14 Aug, 2019

15 commits

  • Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
    the kernel-v5.2.3 testing. This is caused by a race between hugetlb
    page migration and page fault.

    If a hugetlb page can not be allocated to satisfy a page fault, the task
    is sent SIGBUS. This is normal hugetlbfs behavior. A hugetlb fault
    mutex exists to prevent two tasks from trying to instantiate the same
    page. This protects against the situation where there is only one
    hugetlb page, and both tasks would try to allocate. Without the mutex,
    one would fail and SIGBUS even though the other fault would be
    successful.

    There is a similar race between hugetlb page migration and fault.
    Migration code will allocate a page for the target of the migration. It
    will then unmap the original page from all page tables. It does this
    unmap by first clearing the pte and then writing a migration entry. The
    page table lock is held for the duration of this clear and write
    operation. However, the beginnings of the hugetlb page fault code
    optimistically checks the pte without taking the page table lock. If
    clear (as it can be during the migration unmap operation), a hugetlb
    page allocation is attempted to satisfy the fault. Note that the page
    which will eventually satisfy this fault was already allocated by the
    migration code. However, the allocation within the fault path could
    fail which would result in the task incorrectly being sent SIGBUS.

    Ideally, we could take the hugetlb fault mutex in the migration code
    when modifying the page tables. However, locks must be taken in the
    order of hugetlb fault mutex, page lock, page table lock. This would
    require significant rework of the migration code. Instead, the issue is
    addressed in the hugetlb fault code. After failing to allocate a huge
    page, take the page table lock and check for huge_pte_none before
    returning an error. This is the same check that must be made further in
    the code even if page allocation is successful.

    Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Mike Kravetz
    Reported-by: Li Wang
    Tested-by: Li Wang
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Cyril Hrubis
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
    ("mm: reclaim small amounts of memory when an external fragmentation
    event occurs").

    The report is extensive:

    https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/

    and it's worth recording the most relevant parts (colorful language and
    typos included).

    When running a simple, steady state 4kB file creation test to
    simulate extracting tarballs larger than memory full of small
    files into the filesystem, I noticed that once memory fills up
    the cache balance goes to hell.

    The workload is creating one dirty cached inode for every dirty
    page, both of which should require a single IO each to clean and
    reclaim, and creation of inodes is throttled by the rate at which
    dirty writeback runs at (via balance dirty pages). Hence the ingest
    rate of new cached inodes and page cache pages is identical and
    steady. As a result, memory reclaim should quickly find a steady
    balance between page cache and inode caches.

    The moment memory fills, the page cache is reclaimed at a much
    faster rate than the inode cache, and evidence suggests that
    the inode cache shrinker is not being called when large batches
    of pages are being reclaimed. In roughly the same time period
    that it takes to fill memory with 50% pages and 50% slab caches,
    memory reclaim reduces the page cache down to just dirty pages
    and slab caches fill the entirety of memory.

    The LRU is largely full of dirty pages, and we're getting spikes
    of random writeback from memory reclaim so it's all going to shit.
    Behaviour never recovers, the page cache remains pinned at just
    dirty pages, and nothing I could tune would make any difference.
    vfs_cache_pressure makes no difference - I would set it so high
    it should trim the entire inode caches in a single pass, yet it
    didn't do anything. It was clear from tracing and live telemetry
    that the shrinkers were pretty much not running except when
    there was absolutely no memory free at all, and then they did
    the minimum necessary to free memory to make progress.

    So I went looking at the code, trying to find places where pages
    got reclaimed and the shrinkers weren't called. There's only one
    - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
    reclaim small amounts of memory when an external fragmentation
    event occurs").

    The watermark boosting introduced by the commit is triggered in response
    to an allocation "fragmentation event". The boosting was not intended
    to target THP specifically and triggers even if THP is disabled.
    However, with Dave's perfectly reasonable workload, fragmentation events
    can be very common given the ratio of slab to page cache allocations so
    boosting remains active for long periods of time.

    As high-order allocations might use compaction and compaction cannot
    move slab pages the decision was made in the commit to special-case
    kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
    reclaiming slab does not directly help compaction.

    As Dave notes, this decision means that slab can be artificially
    protected for long periods of time and messes up the balance with slab
    and page caches.

    Removing the special casing can still indirectly help avoid
    fragmentation by avoiding fragmentation-causing events due to slab
    allocation as pages from a slab pageblock will have some slab objects
    freed. Furthermore, with the special casing, reclaim behaviour is
    unpredictable as kswapd sometimes examines slab and sometimes does not
    in a manner that is tricky to tune or analyse.

    This patch removes the special casing. The downside is that this is not
    a universal performance win. Some benchmarks that depend on the
    residency of data when rereading metadata may see a regression when slab
    reclaim is restored to its original behaviour. Similarly, some
    benchmarks that only read-once or write-once may perform better when
    page reclaim is too aggressive. The primary upside is that slab
    shrinker is less surprising (arguably more sane but that's a matter of
    opinion), behaves consistently regardless of the fragmentation state of
    the system and properly obeys VM sysctls.

    A fsmark benchmark configuration was constructed similar to what Dave
    reported and is codified by the mmtest configuration
    config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
    machine to avoid dealing with NUMA-related issues and the timing of
    reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
    filesystem was used for the test data.

    This is not an exact replication of Dave's setup. The configuration
    scales its parameters depending on the memory size of the SUT to behave
    similarly across machines. The parameters mean the first sample
    reported by fs_mark is using 50% of RAM which will barely be throttled
    and look like a big outlier. Dave used fake NUMA to have multiple
    kswapd instances which I didn't replicate. Finally, the number of
    iterations differ from Dave's test as the target disk was not large
    enough. While not identical, it should be representative.

    fsmark
    5.3.0-rc3 5.3.0-rc3
    vanilla shrinker-v1r1
    Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
    1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
    2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
    3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
    Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
    Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
    Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
    Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
    Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
    CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
    BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
    BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
    BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
    BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
    BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
    BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)

    5.3.0-rc3 5.3.0-rc3
    vanillashrinker-v1r1
    Duration User 501.82 497.29
    Duration System 4401.44 4424.08
    Duration Elapsed 8124.76 8358.05

    This is showing a slight skew for the max result representing a large
    outlier for the 1st, 2nd and 3rd quartile are similar indicating that
    the bulk of the results show little difference. Note that an earlier
    version of the fsmark configuration showed a regression but that
    included more samples taken while memory was still filling.

    Note that the elapsed time is higher. Part of this is that the
    configuration included time to delete all the test files when the test
    completes -- the test automation handles the possibility of testing
    fsmark with multiple thread counts. Without the patch, many of these
    objects would be memory resident which is part of what the patch is
    addressing.

    There are other important observations that justify the patch.

    1. With the vanilla kernel, the number of dirty pages in the system is
    very low for much of the test. With this patch, dirty pages is
    generally kept at 10% which matches vm.dirty_background_ratio which
    is normal expected historical behaviour.

    2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
    0.95 for much of the test i.e. Slab is being left alone and
    dominating memory consumption. With the patch applied, the ratio
    varies between 0.35 and 0.45 with the bulk of the measured ratios
    roughly half way between those values. This is a different balance to
    what Dave reported but it was at least consistent.

    3. Slabs are scanned throughout the entire test with the patch applied.
    The vanille kernel has periods with no scan activity and then
    relatively massive spikes.

    4. Without the patch, kswapd scan rates are very variable. With the
    patch, the scan rates remain quite steady.

    4. Overall vmstats are closer to normal expectations

    5.3.0-rc3 5.3.0-rc3
    vanilla shrinker-v1r1
    Ops Direct pages scanned 99388.00 328410.00
    Ops Kswapd pages scanned 45382917.00 33451026.00
    Ops Kswapd pages reclaimed 30869570.00 25239655.00
    Ops Direct pages reclaimed 74131.00 5830.00
    Ops Kswapd efficiency % 68.02 75.45
    Ops Kswapd velocity 5585.75 4002.25
    Ops Page reclaim immediate 1179721.00 430927.00
    Ops Slabs scanned 62367361.00 73581394.00
    Ops Direct inode steals 2103.00 1002.00
    Ops Kswapd inode steals 570180.00 5183206.00

    o Vanilla kernel is hitting direct reclaim more frequently,
    not very much in absolute terms but the fact the patch
    reduces it is interesting
    o "Page reclaim immediate" in the vanilla kernel indicates
    dirty pages are being encountered at the tail of the LRU.
    This is generally bad and means in this case that the LRU
    is not long enough for dirty pages to be cleaned by the
    background flush in time. This is much reduced by the
    patch.
    o With the patch, kswapd is reclaiming 10 times more slab
    pages than with the vanilla kernel. This is indicative
    of the watermark boosting over-protecting slab

    A more complete set of tests were run that were part of the basis for
    introducing boosting and while there are some differences, they are well
    within tolerances.

    Bottom line, the special casing kswapd to avoid slab behaviour is
    unpredictable and can lead to abnormal results for normal workloads.

    This patch restores the expected behaviour that slab and page cache is
    balanced consistently for a workload with a steady allocation ratio of
    slab/pagecache pages. It also means that if there are workloads that
    favour the preservation of slab over pagecache that it can be tuned via
    vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
    the parameter when boosting is active.

    Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reviewed-by: Dave Chinner
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local
    hugepage allocations").

    commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a
    severe regression that was reported by the kernel test robot at the end
    of the merge window. Now we understood the regression was a false
    positive and was caused by a significant increase in fairness during a
    swap trashing benchmark. So it's safe to re-apply the fix and continue
    improving the code from there. The benchmark that reported the
    regression is very useful, but it provides a meaningful result only when
    there is no significant alteration in fairness during the workload. The
    removal of __GFP_THISNODE increased fairness.

    __GFP_THISNODE cannot be used in the generic page faults path for new
    memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
    behavior significantly deviates from what the MPOL_DEFAULT semantics are
    supposed to be for THP and 4k allocations alike.

    Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
    set to "madvise") has never meant to provide an implicit MPOL_BIND on
    the "current" node the task is running on, causing swap storms and
    providing a much more aggressive behavior than even zone_reclaim_node =
    3.

    Any workload who could have benefited from __GFP_THISNODE has now to
    enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
    the zone_reclaim_mode behavior, but it only did so if THP was enabled:
    if THP was disabled, there would have been no chance to get any 4k page
    from the current node if the current node was full of pagecache, which
    further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
    MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
    semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
    must work exactly the same with MADV_HUGEPAGE set or not.

    The performance characteristic of memory depends on the hardware
    details. The numbers below are obtained on Naples/EPYC architecture and
    the N/A projection extends them to show what we should aim for in the
    future as a good THP NUMA locality default. The benchmark used
    exercises random memory seeks (note: the cost of the page faults is not
    part of the measurement).

    D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
    0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A

    D0 means distance zero (i.e. local memory), D1 means distance one (i.e.
    intra socket memory), D2 means distance two (i.e. inter socket memory),
    etc...

    For the guest physical memory allocated by qemu and for guest mode
    kernel the performance characteristic of RAM is more complex and an
    ideal default could be:

    D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
    0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A

    NOTE: the N/A are projections and haven't been measured yet, the
    measurement in this case is done on a 1950x with only two NUMA nodes.
    The THP case here means THP was used both in the host and in the guest.

    After applying this commit the THP NUMA locality order that we'll get
    out of MADV_HUGEPAGE is this:

    D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Before this commit it was:

    D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Even if we ignore the breakage of large workloads that can't fit in a
    single node that the __GFP_THISNODE implicit "current node" mbind
    caused, the THP NUMA locality order provided by __GFP_THISNODE was still
    not the one we shall aim for in the long term (i.e. the first one at
    the top).

    After this commit is applied, we can introduce a new allocator multi
    order API and to replace those two alloc_pages_vmas calls in the page
    fault path, with a single multi order call:

    unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
    page = alloc_pages_multi_order(..., &order);
    if (!page)
    goto out;
    if (!(order & (1 << 0))) {
    VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
    /* THP fault */
    } else {
    VM_WARN_ON(order != 1 << 0);
    /* 4k fallback */
    }

    The page allocator logic has to be altered so that when it fails on any
    zone with order 9, it has to try again with a order 0 before falling
    back to the next zone in the zonelist.

    After that we need to do more measurements and evaluate if adding an
    opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
    with "DN+1 THP | DN 4k" at every NUMA distance crossing.

    Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

    The fixes for what was originally reported as "pathological THP
    behavior" we rightfully reverted to be sure not to introduced
    regressions at end of a merge window after a severe regression report
    from the kernel bot. We can safely re-apply them now that we had time
    to analyze the problem.

    The mm process worked fine, because the good fixes were eventually
    committed upstream without excessive delay.

    The regression reported by the kernel bot however forced us to revert
    the good fixes to be sure not to introduce regressions and to give us
    the time to analyze the issue further. The silver lining is that this
    extra time allowed to think more at this issue and also plan for a
    future direction to improve things further in terms of THP NUMA
    locality.

    This patch (of 2):

    This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
    gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
    89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
    alloc_hugepage_direct_gfpmask").

    Consolidation of the THP allocation flags at the same place was meant to
    be a clean up to easier handle otherwise scattered code which is
    imposing a maintenance burden. There were no real problems observed
    with the gfp mask consolidation but the reversion was rushed through
    without a larger consensus regardless.

    This patch brings the consolidation back because this should make the
    long term maintainability easier as well as it should allow future
    changes to be less error prone.

    [mhocko@kernel.org: changelog additions]
    Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Memcg counters for shadow nodes are broken because the memcg pointer is
    obtained in a wrong way. The following approach is used:
    virt_to_page(xa_node)->mem_cgroup

    Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
    page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
    set for slab pages, so memcg_from_slab_page() should be used instead.

    Also I doubt that it ever worked correctly: virt_to_head_page() should
    be used instead of virt_to_page(). Otherwise objects residing on tail
    pages are not accounted, because only the head page contains a valid
    mem_cgroup pointer. That was a case since the introduction of these
    counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
    for shadow nodes").

    Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently, when checking to see if accessing n bytes starting at address
    "ptr" will cause a wraparound in the memory addresses, the check in
    check_bogus_address() adds an extra byte, which is incorrect, as the
    range of addresses that will be accessed is [ptr, ptr + (n - 1)].

    This can lead to incorrectly detecting a wraparound in the memory
    address, when trying to read 4 KB from memory that is mapped to the the
    last possible page in the virtual address space, when in fact, accessing
    that range of memory would not cause a wraparound to occur.

    Use the memory range that will actually be accessed when considering if
    accessing a certain amount of bytes will cause the memory address to
    wrap around.

    Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Signed-off-by: Prasad Sodagudi
    Signed-off-by: Isaac J. Manjarres
    Co-developed-by: Prasad Sodagudi
    Reviewed-by: William Kucharski
    Acked-by: Kees Cook
    Cc: Greg Kroah-Hartman
    Cc: Trilok Soni
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Isaac J. Manjarres
     
  • If an error occurs during kmemleak_init() (e.g. kmem cache cannot be
    created), kmemleak is disabled but kmemleak_early_log remains enabled.
    Subsequently, when the .init.text section is freed, the log_early()
    function no longer exists. To avoid a page fault in such scenario,
    ensure that kmemleak_disable() also disables early logging.

    Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas
    Reported-by: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Recent changes to the vmalloc code by commit 68ad4a330433
    ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
    cause spurious percpu allocation failures. These, in turn, can result
    in panic()s in the slub code. One such possible panic was reported by
    Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
    Another related panic observed is,

    RIP: 0033:0x7f46f7441b9b
    Call Trace:
    dump_stack+0x61/0x80
    pcpu_alloc.cold.30+0x22/0x4f
    mem_cgroup_css_alloc+0x110/0x650
    cgroup_apply_control_enable+0x133/0x330
    cgroup_mkdir+0x41b/0x500
    kernfs_iop_mkdir+0x5a/0x90
    vfs_mkdir+0x102/0x1b0
    do_mkdirat+0x7d/0xf0
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
    to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
    uses two lists (vmap_area_list & free_vmap_area_list) to track the used
    and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
    sizes[], nr_vms, align) function is used for allocating congruent VM
    areas for percpu memory allocator. In order to not conflict with
    VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
    VMALLOC space. So the search for free vm_area for the given requirement
    starts near VMALLOC_END and moves upwards towards VMALLOC_START.

    Prior to commit 68ad4a330433, the search for free vm_area in
    pcpu_get_vm_areas() involves following two main steps.

    Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
    Step 2:
    Loop through number of requested vm_areas and check,
    Step 2.1:
    if (base < VMALLOC_START)
    1. fail with error
    Step 2.2:
    // end is offsets[area] + sizes[area]
    if (base + end > va->vm_end)
    1. Move the base downwards and repeat Step 2
    Step 2.3:
    if (base + start < va->vm_start)
    1. Move to previous free vm_area node, find aligned
    base address and repeat Step 2

    But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

    Step 2.3:
    if (base + start < va->vm_start || base + end > va->vm_end)
    1. Move to previous free vm_area node, find aligned
    base address and repeat Step 2

    Above change is the root cause of spurious percpu memory allocation
    failures. For example, consider a case where a relatively large vm_area
    (~ 30 TB) was ignored in free vm_area search because it did not pass the
    base + end < vm->vm_end boundary check. Ignoring such large free
    vm_area's would lead to not finding free vm_area within boundary of
    VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

    So modify the search algorithm to include Step 2.2.

    Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
    Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
    Signed-off-by: Kuppuswamy Sathyanarayanan
    Reported-by: Dave Hansen
    Acked-by: Dennis Zhou
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc: Roman Gushchin
    Cc: sathyanarayanan kuppuswamy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuppuswamy Sathyanarayanan
     
  • This patch is sent to report an use after free in mem_cgroup_iter()
    after merging commit be2657752e9e ("mm: memcg: fix use after free in
    mem_cgroup_iter()").

    I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
    ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
    to the trees. However, I can still observe use after free issues
    addressed in the commit be2657752e9e. (on low-end devices, a few times
    this month)

    backtrace:
    css_tryget stat);
    + /* poison memcg before freeing it */
    + memset(memcg, 0x78, sizeof(struct mem_cgroup));
    kfree(memcg);
    }

    The coredump shows the position=0xdbbc2a00 is freed.

    (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
    $13 = {position = 0xdbbc2a00, generation = 0x2efd}

    0xdbbc2a00: 0xdbbc2e00 0x00000000 0xdbbc2800 0x00000100
    0xdbbc2a10: 0x00000200 0x78787878 0x00026218 0x00000000
    0xdbbc2a20: 0xdcad6000 0x00000001 0x78787800 0x00000000
    0xdbbc2a30: 0x78780000 0x00000000 0x0068fb84 0x78787878
    0xdbbc2a40: 0x78787878 0x78787878 0x78787878 0xe3fa5cc0
    0xdbbc2a50: 0x78787878 0x78787878 0x00000000 0x00000000
    0xdbbc2a60: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a70: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a80: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a90: 0x00000001 0x00000000 0x00000000 0x00100000
    0xdbbc2aa0: 0x00000001 0xdbbc2ac8 0x00000000 0x00000000
    0xdbbc2ab0: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2ac0: 0x00000000 0x00000000 0xe5b02618 0x00001000
    0xdbbc2ad0: 0x00000000 0x78787878 0x78787878 0x78787878
    0xdbbc2ae0: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2af0: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b00: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b10: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b20: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b30: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b40: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b50: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b60: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b70: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b80: 0x78787878 0x78787878 0x00000000 0x78787878
    0xdbbc2b90: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2ba0: 0x78787878 0x78787878 0x78787878 0x78787878

    In the reclaim path, try_to_free_pages() does not setup
    sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
    shrink_node().

    In mem_cgroup_iter(), root is set to root_mem_cgroup because
    sc->target_mem_cgroup is NULL. It is possible to assign a memcg to
    root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().

    try_to_free_pages
    struct scan_control sc = {...}, target_mem_cgroup is 0x0;
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup *root = sc->target_mem_cgroup;
    memcg = mem_cgroup_iter(root, NULL, &reclaim);
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...

    css = css_next_descendant_pre(css, &root->css);
    memcg = mem_cgroup_from_css(css);
    cmpxchg(&iter->position, pos, memcg);

    My device uses memcg non-hierarchical mode. When we release a memcg:
    invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
    If non-hierarchical mode is used, invalidate_reclaim_iterators() never
    reaches root_mem_cgroup.

    static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
    {
    struct mem_cgroup *memcg = dead_memcg;

    for (; memcg; memcg = parent_mem_cgroup(memcg)
    ...
    }

    So the use after free scenario looks like:

    CPU1 CPU2

    try_to_free_pages
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...
    css = css_next_descendant_pre(css, &root->css);
    memcg = mem_cgroup_from_css(css);
    cmpxchg(&iter->position, pos, memcg);

    invalidate_reclaim_iterators(memcg);
    ...
    __mem_cgroup_free()
    kfree(memcg);

    try_to_free_pages
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...
    mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
    iter = &mz->iter[reclaim->priority];
    pos = READ_ONCE(iter->position);
    css_tryget(&pos->css)
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • The constraint from the zpool use of z3fold_destroy_pool() is there are
    no outstanding handles to memory (so no active allocations), but it is
    possible for there to be outstanding work on either of the two wqs in
    the pool.

    Calling z3fold_deregister_migration() before the workqueues are drained
    means that there can be allocated pages referencing a freed inode,
    causing any thread in compaction to be able to trip over the bad pointer
    in PageMovable().

    Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
    Fixes: 1f862989b04a ("mm/z3fold.c: support page migration")
    Signed-off-by: Henry Burns
    Reviewed-by: Shakeel Butt
    Reviewed-by: Jonathan Adams
    Cc: Vitaly Vul
    Cc: Vitaly Wool
    Cc: David Howells
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Henry Burns
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     
  • The constraint from the zpool use of z3fold_destroy_pool() is there are
    no outstanding handles to memory (so no active allocations), but it is
    possible for there to be outstanding work on either of the two wqs in
    the pool.

    If there is work queued on pool->compact_workqueue when it is called,
    z3fold_destroy_pool() will do:

    z3fold_destroy_pool()
    destroy_workqueue(pool->release_wq)
    destroy_workqueue(pool->compact_wq)
    drain_workqueue(pool->compact_wq)
    do_compact_page(zhdr)
    kref_put(&zhdr->refcount)
    __release_z3fold_page(zhdr, ...)
    queue_work_on(pool->release_wq, &pool->work) *BOOM*

    So compact_wq needs to be destroyed before release_wq.

    Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
    Fixes: 5d03a6613957 ("mm/z3fold.c: use kref to prevent page free/compact race")
    Signed-off-by: Henry Burns
    Reviewed-by: Shakeel Butt
    Reviewed-by: Jonathan Adams
    Cc: Vitaly Vul
    Cc: Vitaly Wool
    Cc: David Howells
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     
  • When running syzkaller internally, we ran into the below bug on 4.9.x
    kernel:

    kernel BUG at mm/huge_memory.c:2124!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
    task: ffff880067b34900 task.stack: ffff880068998000
    RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
    Call Trace:
    split_huge_page include/linux/huge_mm.h:100 [inline]
    queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:90 [inline]
    walk_pgd_range mm/pagewalk.c:116 [inline]
    __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
    walk_page_range+0x154/0x370 mm/pagewalk.c:285
    queue_pages_range+0x115/0x150 mm/mempolicy.c:694
    do_mbind mm/mempolicy.c:1241 [inline]
    SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
    SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
    do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
    entry_SYSCALL_64_after_swapgs+0x5d/0xdb
    Code: c7 80 1c 02 00 e8 26 0a 76 01 0b 48 c7 c7 40 46 45 84 e8 4c
    RIP [] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
    RSP

    with the below test:

    uint64_t r[1] = {0xffffffffffffffff};

    int main(void)
    {
    syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    intptr_t res = 0;
    res = syscall(__NR_socket, 0x11, 3, 0x300);
    if (res != -1)
    r[0] = res;
    *(uint32_t*)0x20000040 = 0x10000;
    *(uint32_t*)0x20000044 = 1;
    *(uint32_t*)0x20000048 = 0xc520;
    *(uint32_t*)0x2000004c = 1;
    syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
    syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
    *(uint64_t*)0x20000340 = 2;
    syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
    return 0;
    }

    Actually the test does:

    mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
    socket(AF_PACKET, SOCK_RAW, 768) = 3
    setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
    mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
    mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0

    The setsockopt() would allocate compound pages (16 pages in this test)
    for packet tx ring, then the mmap() would call packet_mmap() to map the
    pages into the user address space specified by the mmap() call.

    When calling mbind(), it would scan the vma to queue the pages for
    migration to the new node. It would split any huge page since 4.9
    doesn't support THP migration, however, the packet tx ring compound
    pages are not THP and even not movable. So, the above bug is triggered.

    However, the later kernel is not hit by this issue due to commit
    d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"),
    which just removes the PageSwapBacked check for a different reason.

    But, there is a deeper issue. According to the semantic of mbind(), it
    should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
    MPOL_MF_STRICT was also specified, but the kernel was unable to move all
    existing pages in the range. The tx ring of the packet socket is
    definitely not movable, however, mbind() returns success for this case.

    Although the most socket file associates with non-movable pages, but XDP
    may have movable pages from gup. So, it sounds not fine to just check
    the underlying file type of vma in vma_migratable().

    Change migrate_page_add() to check if the page is movable or not, if it
    is unmovable, just return -EIO. But do not abort pte walk immediately,
    since there may be pages off LRU temporarily. We should migrate other
    pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some
    paged could not be not moved, then return -EIO for mbind() eventually.

    With this change the above test would return -EIO as expected.

    [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
    Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
    try best to migrate misplaced pages, if some of the pages could not be
    migrated, then return -EIO.

    There are three different sub-cases:
    1. vma is not migratable
    2. vma is migratable, but there are unmovable pages
    3. vma is migratable, pages are movable, but migrate_pages() fails

    If #1 happens, kernel would just abort immediately, then return -EIO,
    after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
    MPOL_MF_STRICT is specified").

    If #3 happens, kernel would set policy and migrate pages with
    best-effort, but won't rollback the migrated pages and reset the policy
    back.

    Before that commit, they behaves in the same way. It'd better to keep
    their behavior consistent. But, rolling back the migrated pages and
    resetting the policy back sounds not feasible, so just make #1 behave as
    same as #3.

    Userspace will know that not everything was successfully migrated (via
    -EIO), and can take whatever steps it deems necessary - attempt
    rollback, determine which exact page(s) are violating the policy, etc.

    Make queue_pages_range() return 1 to indicate there are unmovable pages
    or vma is not migratable.

    The #2 is not handled correctly in the current kernel, the following
    patch will fix it.

    [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
    Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When migrating an anonymous private page to a ZONE_DEVICE private page,
    the source page->mapping and page->index fields are copied to the
    destination ZONE_DEVICE struct page and the page_mapcount() is
    increased. This is so rmap_walk() can be used to unmap and migrate the
    page back to system memory.

    However, try_to_unmap_one() computes the subpage pointer from a swap pte
    which computes an invalid page pointer and a kernel panic results such
    as:

    BUG: unable to handle page fault for address: ffffea1fffffffc8

    Currently, only single pages can be migrated to device private memory so
    no subpage computation is needed and it can be set to "page".

    [rcampbell@nvidia.com: add comment]
    Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
    Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Signed-off-by: Ralph Campbell
    Cc: "Jérôme Glisse"
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Ira Weiny
    Cc: Jan Kara
    Cc: Lai Jiangshan
    Cc: Logan Gunthorpe
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • When a ZONE_DEVICE private page is freed, the page->mapping field can be
    set. If this page is reused as an anonymous page, the previous value
    can prevent the page from being inserted into the CPU's anon rmap table.
    For example, when migrating a pte_none() page to device memory:

    migrate_vma(ops, vma, start, end, src, dst, private)
    migrate_vma_collect()
    src[] = MIGRATE_PFN_MIGRATE
    migrate_vma_prepare()
    /* no page to lock or isolate so OK */
    migrate_vma_unmap()
    /* no page to unmap so OK */
    ops->alloc_and_copy()
    /* driver allocates ZONE_DEVICE page for dst[] */
    migrate_vma_pages()
    migrate_vma_insert_page()
    page_add_new_anon_rmap()
    __page_set_anon_rmap()
    /* This check sees the page's stale mapping field */
    if (PageAnon(page))
    return
    /* page->mapping is not updated */

    The result is that the migration appears to succeed but a subsequent CPU
    fault will be unable to migrate the page back to system memory or worse.

    Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
    pointer data doesn't affect future page use.

    Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
    Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: "Jérôme Glisse"
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

10 Aug, 2019

1 commit

  • Currently, attempts to shutdown and re-enable a device-dax instance
    trigger:

    Missing reference count teardown definition
    WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850
    [..]
    RIP: 0010:devm_memremap_pages+0x234/0x850
    [..]
    Call Trace:
    dev_dax_probe+0x66/0x190 [device_dax]
    really_probe+0xef/0x390
    driver_probe_device+0xb4/0x100
    device_driver_attach+0x4f/0x60

    Given that the setup path initializes pgmap->ref, arrange for it to be
    also torn down so devm_memremap_pages() is ready to be called again and
    not be mistaken for the 3rd-party per-cpu-ref case.

    Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in struct dev_pagemap")
    Reported-by: Fan Du
    Tested-by: Vishal Verma
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

03 Aug, 2019

7 commits

  • memremap.c implements MM functionality for ZONE_DEVICE, so it really
    should be in the mm/ directory, not the kernel/ one.

    Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Anshuman Khandual
    Acked-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • return is unneeded in void function

    Link: http://lkml.kernel.org/r/20190723130814.21826-1-houweitaoo@gmail.com
    Signed-off-by: Weitao Hou
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weitao Hou
     
  • When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
    migrate_vma_collect() which initializes a struct mm_walk but didn't
    initialize mm_walk.pud_entry. (Found by code inspection) Use a C
    structure initialization to make sure it is set to NULL.

    Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
    Fixes: 8763cb45ab967 ("mm/migrate: new memory migration helper for use with device memory")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Andrew Morton
    Cc: "Jérôme Glisse"
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • "howaboutsynergy" reported via kernel buzilla number 204165 that
    compact_zone_order was consuming 100% CPU during a stress test for
    prolonged periods of time. Specifically the following command, which
    should exit in 10 seconds, was taking an excessive time to finish while
    the CPU was pegged at 100%.

    stress -m 220 --vm-bytes 1000000000 --timeout 10

    Tracing indicated a pattern as follows

    stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0

    Note that compaction is entered in rapid succession while scanning and
    isolating nothing. The problem is that when a task that is compacting
    receives a fatal signal, it retries indefinitely instead of exiting
    while making no progress as a fatal signal is pending.

    It's not easy to trigger this condition although enabling zswap helps on
    the basis that the timing is altered. A very small window has to be hit
    for the problem to occur (signal delivered while compacting and
    isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).

    This was reproduced locally -- 16G single socket system, 8G swap, 30%
    zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng
    implementation from github running in a loop until the problem hits).
    Tracing recorded the problem occurring almost 200K times in a short
    window. With this patch, the problem hit 4 times but the task existed
    normally instead of consuming CPU.

    This problem has existed for some time but it was made worse by commit
    cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as
    contention"). Before that commit, if the same condition was hit then
    locks would be quickly contended and compaction would exit that way.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
    Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
    Fixes: cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as contention")
    Signed-off-by: Mel Gorman
    Reviewed-by: Vlastimil Babka
    Cc: [5.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • buffer_migrate_page_norefs() can race with bh users in the following
    way:

    CPU1 CPU2
    buffer_migrate_page_norefs()
    buffer_migrate_lock_buffers()
    checks bh refs
    spin_unlock(&mapping->private_lock)
    __find_get_block()
    spin_lock(&mapping->private_lock)
    grab bh ref
    spin_unlock(&mapping->private_lock)
    move page do bh work

    This can result in various issues like lost updates to buffers (i.e.
    metadata corruption) or use after free issues for the old page.

    This patch closes the race by holding mapping->private_lock while the
    mapping is being moved to a new page. Ordinarily, a reference can be
    taken outside of the private_lock using the per-cpu BH LRU but the
    references are checked and the LRU invalidated if necessary. The
    private_lock is held once the references are known so the buffer lookup
    slow path will spin on the private_lock. Between the page lock and
    private_lock, it should be impossible for other references to be
    acquired and updates to happen during the migration.

    A user had reported data corruption issues on a distribution kernel with
    a similar page migration implementation as mainline. The data
    corruption could not be reproduced with this patch applied. A small
    number of migration-intensive tests were run and no performance problems
    were noted.

    [mgorman@techsingularity.net: Changelog, removed tracing]
    Link: http://lkml.kernel.org/r/20190718090238.GF24383@techsingularity.net
    Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
    Signed-off-by: Jan Kara
    Signed-off-by: Mel Gorman
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Shakeel Butt reported premature oom on kernel with
    "cgroup_disable=memory" since mem_cgroup_is_root() returns false even
    though memcg is actually NULL. The drop_caches is also broken.

    It is because commit aeed1d325d42 ("mm/vmscan.c: generalize
    shrink_slab() calls in shrink_node()") removed the !memcg check before
    !mem_cgroup_is_root(). And, surprisingly root memcg is allocated even
    though memory cgroup is disabled by kernel boot parameter.

    Add mem_cgroup_disabled() check to make reclaimer work as expected.

    Link: http://lkml.kernel.org/r/1563385526-20805-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: aeed1d325d42 ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()")
    Signed-off-by: Yang Shi
    Reported-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Jan Hadrava
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Hugh Dickins
    Cc: Qian Cai
    Cc: Kirill A. Shutemov
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When running ltp's oom test with kmemleak enabled, the below warning was
    triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
    passed in:

    WARNING: CPU: 105 PID: 2138 at mm/page_alloc.c:4608 __alloc_pages_nodemask+0x1c31/0x1d50
    Modules linked in: loop dax_pmem dax_pmem_core ip_tables x_tables xfs virtio_net net_failover virtio_blk failover ata_generic virtio_pci virtio_ring virtio libata
    CPU: 105 PID: 2138 Comm: oom01 Not tainted 5.2.0-next-20190710+ #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__alloc_pages_nodemask+0x1c31/0x1d50
    ...
    kmemleak_alloc+0x4e/0xb0
    kmem_cache_alloc+0x2a7/0x3e0
    mempool_alloc_slab+0x2d/0x40
    mempool_alloc+0x118/0x2b0
    bio_alloc_bioset+0x19d/0x350
    get_swap_bio+0x80/0x230
    __swap_writepage+0x5ff/0xb20

    The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
    has __GFP_NOFAIL set all the time due to d9570ee3bd1d4f2 ("kmemleak:
    allow to coexist with fault injection"). But, it doesn't make any sense
    to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
    time.

    According to the discussion on the mailing list, the commit should be
    reverted for short term solution. Catalin Marinas would follow up with
    a better solution for longer term.

    The failure rate of kmemleak metadata allocation may increase in some
    circumstances, but this should be expected side effect.

    Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: d9570ee3bd1d4f2 ("kmemleak: allow to coexist with fault injection")
    Signed-off-by: Yang Shi
    Suggested-by: Catalin Marinas
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: David Rientjes
    Cc: Matthew Wilcox
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

01 Aug, 2019

1 commit

  • To properly clear the slab on free with slab_want_init_on_free, we walk
    the list of free objects using get_freepointer/set_freepointer.

    The value we get from get_freepointer may not be valid. This isn't an
    issue since an actual value will get written later but this means
    there's a chance of triggering a bug if we use this value with
    set_freepointer:

    kernel BUG at mm/slub.c:306!
    invalid opcode: 0000 [#1] PREEMPT PTI
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4
    RIP: 0010:kfree+0x58a/0x5c0
    Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
    RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
    RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
    RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
    RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
    R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
    FS: 0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
    Call Trace:
    apply_wqattrs_prepare+0x154/0x280
    apply_workqueue_attrs_locked+0x4e/0xe0
    apply_workqueue_attrs+0x36/0x60
    alloc_workqueue+0x25a/0x6d0
    workqueue_init_early+0x246/0x348
    start_kernel+0x3c7/0x7ec
    x86_64_start_reservations+0x40/0x49
    x86_64_start_kernel+0xda/0xe4
    secondary_startup_64+0xb6/0xc0
    Modules linked in:
    ---[ end trace f67eb9af4d8d492b ]---

    Fix this by ensuring the value we set with set_freepointer is either NULL
    or another value in the chain.

    Reported-by: kernel test robot
    Signed-off-by: Laura Abbott
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Reviewed-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

31 Jul, 2019

1 commit

  • Pull HMM fixes from Jason Gunthorpe:
    "Fix the locking around nouveau's use of the hmm_range_* APIs. It works
    correctly in the success case, but many of the the edge cases have
    missing unlocks or double unlocks.

    The diffstat is a bit big as Christoph did a comprehensive job to move
    the obsolete API from the core header and into the driver before
    fixing its flow, but the risk of regression from this code motion is
    low"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    nouveau: unlock mmap_sem on all errors from nouveau_range_fault
    nouveau: remove the block parameter to nouveau_range_fault
    mm/hmm: move hmm_vma_range_done and hmm_vma_fault to nouveau
    mm/hmm: always return EBUSY for invalid ranges in hmm_range_{fault,snapshot}

    Linus Torvalds
     

30 Jul, 2019

1 commit

  • Pull virtio/vhost fixes from Michael Tsirkin:

    - Fixes in the iommu and balloon devices.

    - Disable the meta-data optimization for now - I hope we can get it
    fixed shortly, but there's no point in making users suffer crashes
    while we are working on that.

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    vhost: disable metadata prefetch optimization
    iommu/virtio: Update to most recent specification
    balloon: fix up comments
    mm/balloon_compaction: avoid duplicate page removal

    Linus Torvalds
     

26 Jul, 2019

1 commit

  • We should not have two different error codes for the same
    condition. EAGAIN must be reserved for the FAULT_FLAG_ALLOW_RETRY retry
    case and signals to the caller that the mmap_sem has been unlocked.

    Use EBUSY for the !valid case so that callers can get the locking right.

    Link: https://lore.kernel.org/r/20190724065258.16603-2-hch@lst.de
    Tested-by: Ralph Campbell
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Felix Kuehling
    [jgg: elaborated commit message]
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

22 Jul, 2019

3 commits

  • Lots of comments bitrotted. Fix them up.

    Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces)
    Reviewed-by: Wei Wang
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Ralph Campbell
    Acked-by: Nadav Amit

    Michael S. Tsirkin
     
  • A #GP is reported in the guest when requesting balloon inflation via
    virtio-balloon. The reason is that the virtio-balloon driver has
    removed the page from its internal page list (via balloon_page_pop),
    but balloon_page_enqueue_one also calls "list_del" to do the removal.
    This is necessary when it's used from balloon_page_enqueue_list, but
    not from balloon_page_enqueue.

    Move list_del to balloon_page_enqueue, and update comments accordingly.

    Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces)
    Signed-off-by: Wei Wang
    Signed-off-by: Michael S. Tsirkin

    Wei Wang
     
  • On x86-32 with PTI enabled, parts of the kernel page-tables are not shared
    between processes. This can cause mappings in the vmalloc/ioremap area to
    persist in some page-tables after the region is unmapped and released.

    When the region is re-used the processes with the old mappings do not fault
    in the new mappings but still access the old ones.

    This causes undefined behavior, in reality often data corruption, kernel
    oopses and panics and even spontaneous reboots.

    Fix this problem by activly syncing unmaps in the vmalloc/ioremap area to
    all page-tables in the system before the regions can be re-used.

    References: https://bugzilla.suse.com/show_bug.cgi?id=1118689
    Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F')
    Signed-off-by: Joerg Roedel
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Dave Hansen
    Link: https://lkml.kernel.org/r/20190719184652.11391-4-joro@8bytes.org

    Joerg Roedel
     

20 Jul, 2019

2 commits

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     
  • Merge yet more updates from Andrew Morton:
    "The rest of MM and a kernel-wide procfs cleanup.

    Summary of the more significant patches:

    - Patch series "mm/memory_hotplug: Factor out memory block
    devicehandling", v3. David Hildenbrand.

    Some spring-cleaning of the memory hotplug code, notably in
    drivers/base/memory.c

    - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
    Shi.

    Fix /proc/pid/smaps output for THP pages used in shmem.

    - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.

    Bugfix and speedup for kernel/resource.c

    - Patch series "mm: Further memory block device cleanups", David
    Hildenbrand.

    More spring-cleaning of the memory hotplug code.

    - Patch series "mm: Sub-section memory hotplug support". Dan
    Williams.

    Generalise the memory hotplug code so that pmem can use it more
    completely. Then remove the hacks from the libnvdimm code which
    were there to work around the memory-hotplug code's constraints.

    - "proc/sysctl: add shared variables for range check", Matteo Croce.

    We have about 250 instances of

    int zero;
    ...
    .extra1 = &zero,

    in the tree. This is a tree-wide sweep to make all those private
    "zero"s and "one"s use global variables.

    Alas, it isn't practical to make those two global integers const"

    * emailed patches from Andrew Morton : (38 commits)
    proc/sysctl: add shared variables for range check
    mm: migrate: remove unused mode argument
    mm/sparsemem: cleanup 'section number' data types
    libnvdimm/pfn: stop padding pmem namespaces to section alignment
    libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
    mm/devm_memremap_pages: enable sub-section remap
    mm: document ZONE_DEVICE memory-model implications
    mm/sparsemem: support sub-section hotplug
    mm/sparsemem: prepare for sub-section ranges
    mm: kill is_dev_zone() helper
    mm/hotplug: kill is_dev_zone() usage in __remove_pages()
    mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
    mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
    mm/sparsemem: add helpers track active portions of a section at boot
    mm/sparsemem: introduce a SECTION_IS_EARLY flag
    mm/sparsemem: introduce struct mem_section_usage
    drivers/base/memory.c: get rid of find_memory_block_hinted()
    mm/memory_hotplug: move and simplify walk_memory_blocks()
    mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
    mm: make register_mem_sect_under_node() static
    ...

    Linus Torvalds
     

19 Jul, 2019

8 commits

  • migrate_page_move_mapping() doesn't use the mode argument. Remove it
    and update callers accordingly.

    Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
    Signed-off-by: Keith Busch
    Reviewed-by: Zi Yan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Busch
     
  • David points out that there is a mixture of 'int' and 'unsigned long'
    usage for section number data types. Update the memory hotplug path to
    use 'unsigned long' consistently for section numbers.

    [akpm@linux-foundation.org: fix printk format]
    Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: David Hildenbrand
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The libnvdimm sub-system has suffered a series of hacks and broken
    workarounds for the memory-hotplug implementation's awkward
    section-aligned (128MB) granularity.

    For example the following backtrace is emitted when attempting
    arch_add_memory() with physical address ranges that intersect 'System
    RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
    2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
    devm_memremap_pages+0x460/0x6e0
    pmem_attach_disk+0x29e/0x680 [nd_pmem]
    ? nd_dax_probe+0xfc/0x120 [libnvdimm]
    nvdimm_bus_probe+0x66/0x160 [libnvdimm]

    It was discovered that the problem goes beyond RAM vs PMEM collisions as
    some platform produce PMEM vs PMEM collisions within a given section.
    The libnvdimm workaround for that case revealed that the libnvdimm
    section-alignment-padding implementation has been broken for a long
    while.

    A fix for that long-standing breakage introduces as many problems as it
    solves as it would require a backward-incompatible change to the
    namespace metadata interpretation. Instead of that dubious route [1],
    address the root problem in the memory-hotplug implementation.

    Note that EEXIST is no longer treated as success as that is how
    sparse_add_section() reports subsection collisions, it was also obviated
    by recent changes to perform the request_region() for 'System RAM'
    before arch_add_memory() in the add_memory() sequence.

    [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

    [osalvador@suse.de: fix deactivate_section for early sections]
    Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare the memory hot-{add,remove} paths for handling sub-section
    ranges by plumbing the starting page frame and number of pages being
    handled through arch_{add,remove}_memory() to
    sparse_{add,remove}_one_section().

    This is simply plumbing, small cleanups, and some identifier renames.
    No intended functional changes.

    Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Given there are no more usages of is_dev_zone() outside of 'ifdef
    CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

    Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Wei Yang
    Acked-by: David Hildenbrand
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The zone type check was a leftover from the cleanup that plumbed altmap
    through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
    vmem_altmap to arch_remove_memory and __remove_pages".

    Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Allow sub-section sized ranges to be added to the memmap.

    populate_section_memmap() takes an explict pfn range rather than
    assuming a full section, and those parameters are plumbed all the way
    through to vmmemap_populate(). There should be no sub-section usage in
    current deployments. New warnings are added to clarify which memmap
    allocation paths are sub-section capable.

    Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Sub-section hotplug support reduces the unit of operation of hotplug
    from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
    (PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
    PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
    valid_section(), can toggle.

    [osalvador@suse.de: fix shrink_{zone,node}_span]
    Link: http://lkml.kernel.org/r/20190717090725.23618-3-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092351496.979959.12703722803097017492.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams