31 Aug, 2019

6 commits

  • Instead of using raw_cpu_read() use per_cpu() to read the actual data of
    the corresponding cpu otherwise we will be reading the data of the
    current cpu for the number of online CPUs.

    Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com
    Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
    Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
    Signed-off-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Adric Blake has noticed[1] the following warning:

    WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
    [...]
    Call Trace:
    mem_cgroup_shrink_node+0x9b/0x1d0
    mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
    balance_pgdat+0x276/0x540
    kswapd+0x200/0x3f0
    ? wait_woken+0x80/0x80
    kthread+0xfd/0x130
    ? balance_pgdat+0x540/0x540
    ? kthread_park+0x80/0x80
    ret_from_fork+0x35/0x40
    ---[ end trace 727343df67b2398a ]---

    which tells us that soft limit reclaim is about to overwrite the
    reclaim_state configured up in the call chain (kswapd in this case but
    the direct reclaim is equally possible). This means that reclaim stats
    would get misleading once the soft reclaim returns and another reclaim
    is done.

    Fix the warning by dropping set_task_reclaim_state from the soft reclaim
    which is always called with reclaim_state set up.

    [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com

    Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Adric Blake
    Acked-by: Yafang Shao
    Acked-by: Yang Shi
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix lock/unlock imbalance by unlocking *zhdr* before return.

    Addresses Coverity ID 1452811 ("Missing unlock")

    Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor
    Fixes: d776aaa9895e ("mm/z3fold.c: fix race between migration and destruction")
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Andrew Morton
    Cc: Henry Burns
    Cc: Vitaly Wool
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • …h the hierarchical ones"

    Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
    with the hierarchical ones") effectively decreased the precision of
    per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.

    That's good for displaying in memory.stat, but brings a serious
    regression into the reclaim process.

    One issue I've discovered and debugged is the following:
    lruvec_lru_size() can return 0 instead of the actual number of pages in
    the lru list, preventing the kernel to reclaim last remaining pages.
    Result is yet another dying memory cgroups flooding. The opposite is
    also happening: scanning an empty lru list is the waste of cpu time.

    Also, inactive_list_is_low() can return incorrect values, preventing the
    active lru from being scanned and freed. It can fail both because the
    size of active and inactive lists are inaccurate, and because the number
    of workingset refaults isn't precise. In other words, the result is
    pretty random.

    I'm not sure, if using the approximate number of slab pages in
    count_shadow_number() is acceptable, but issues described above are
    enough to partially revert the patch.

    Let's keep per-memcg vmstat_local batched (they are only used for
    displaying stats to the userspace), but keep lruvec stats precise. This
    change fixes the dead memcg flooding on my setup.

    Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com
    Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
    Signed-off-by: Roman Gushchin <guro@fb.com>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Roman Gushchin
     
  • Fixes: 701d678599d0c1 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
    Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.com
    Reported-by: kbuild test robot
    Cc: Sergey Senozhatsky
    Cc: Henry Burns
    Cc: Minchan Kim
    Cc: Shakeel Butt
    Cc: Jonathan Adams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • I've noticed that the "slab" value in memory.stat is sometimes 0, even
    if some children memory cgroups have a non-zero "slab" value. The
    following investigation showed that this is the result of the kmem_cache
    reparenting in combination with the per-cpu batching of slab vmstats.

    At the offlining some vmstat value may leave in the percpu cache, not
    being propagated upwards by the cgroup hierarchy. It means that stats
    on ancestor levels are lower than actual. Later when slab pages are
    released, the precise number of pages is substracted on the parent
    level, making the value negative. We don't show negative values, 0 is
    printed instead.

    To fix this issue, let's flush percpu slab memcg and lruvec stats on
    memcg offlining. This guarantees that numbers on all ancestor levels
    are accurate and match the actual number of outstanding slab pages.

    Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com
    Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
    Signed-off-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

25 Aug, 2019

8 commits

  • The code like this:

    ptr = kmalloc(size, GFP_KERNEL);
    page = virt_to_page(ptr);
    offset = offset_in_page(ptr);
    kfree(page_address(page) + offset);

    may produce false-positive invalid-free reports on the kernel with
    CONFIG_KASAN_SW_TAGS=y.

    In the example above we lose the original tag assigned to 'ptr', so
    kfree() gets the pointer with 0xFF tag. In kfree() we check that 0xFF
    tag is different from the tag in shadow hence print false report.

    Instead of just comparing tags, do the following:

    1) Check that shadow doesn't contain KASAN_TAG_INVALID. Otherwise it's
    double-free and it doesn't matter what tag the pointer have.

    2) If pointer tag is different from 0xFF, make sure that tag in the
    shadow is the same as in the pointer.

    Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com
    Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
    Signed-off-by: Andrey Ryabinin
    Reported-by: Walter Wu
    Reported-by: Mark Rutland
    Reviewed-by: Andrey Konovalov
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
    have no guarantee that migration isn't happening in the background at
    that time.

    Since migration can't directly free pages, it relies on free_work being
    scheduled to free the pages. But there's nothing preventing an
    in-progress migrate from queuing the work *after*
    zs_unregister_migration() has called flush_work(). Which would mean
    pages still pointing at the inode when we free it.

    Since we know at destroy time all objects should be free, no new
    migrations can come in (since zs_page_isolate() fails for fully-free
    zspages). This means it is sufficient to track a "# isolated zspages"
    count by class, and have the destroy logic ensure all such pages have
    drained before proceeding. Keeping that state under the class spinlock
    keeps the logic straightforward.

    In this case a memory leak could lead to an eventual crash if compaction
    hits the leaked page. This crash would only occur if people are
    changing their zswap backend at runtime (which eventually starts
    destruction).

    Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com
    Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
    Signed-off-by: Henry Burns
    Reviewed-by: Sergey Senozhatsky
    Cc: Henry Burns
    Cc: Minchan Kim
    Cc: Shakeel Butt
    Cc: Jonathan Adams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     
  • In zs_page_migrate() we call putback_zspage() after we have finished
    migrating all pages in this zspage. However, the return value is
    ignored. If a zs_free() races in between zs_page_isolate() and
    zs_page_migrate(), freeing the last object in the zspage,
    putback_zspage() will leave the page in ZS_EMPTY for potentially an
    unbounded amount of time.

    To fix this, we need to do the same thing as zs_page_putback() does:
    schedule free_work to occur.

    To avoid duplicated code, move the sequence to a new
    putback_zspage_deferred() function which both zs_page_migrate() and
    zs_page_putback() call.

    Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com
    Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
    Signed-off-by: Henry Burns
    Reviewed-by: Sergey Senozhatsky
    Cc: Henry Burns
    Cc: Minchan Kim
    Cc: Shakeel Butt
    Cc: Jonathan Adams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     
  • THP splitting path is missing the split_page_owner() call that
    split_page() has.

    As a result, split THP pages are wrongly reported in the page_owner file
    as order-9 pages. Furthermore when the former head page is freed, the
    remaining former tail pages are not listed in the page_owner file at
    all. This patch fixes that by adding the split_page_owner() call into
    __split_huge_page().

    Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
    Fixes: a9627bc5e34e ("mm/page_owner: introduce split_page_owner and replace manual handling")
    Reported-by: Kirill A. Shutemov
    Signed-off-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Similar to vmstats, percpu caching of local vmevents leads to an
    accumulation of errors on non-leaf levels. This happens because some
    leftovers may remain in percpu caches, so that they are never propagated
    up by the cgroup tree and just disappear into nonexistence with on
    releasing of the memory cgroup.

    To fix this issue let's accumulate and propagate percpu vmevents values
    before releasing the memory cgroup similar to what we're doing with
    vmstats.

    Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
    only over online cpus.

    Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
    Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Percpu caching of local vmstats with the conditional propagation by the
    cgroup tree leads to an accumulation of errors on non-leaf levels.

    Let's imagine two nested memory cgroups A and A/B. Say, a process
    belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu
    cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
    and A atomic vmstat counters, 4 pages will remain in the percpu cache.

    Imagine A/B is nearby memory.max, so that every following allocation
    triggers a direct reclaim on the local CPU. Say, each such attempt will
    free 16 pages on a new cpu. That means every percpu cache will have -16
    pages, except the first one, which will have 4 - 16 = -12. A/B and A
    atomic counters will not be touched at all.

    Now a user removes A/B. All percpu caches are freed and corresponding
    vmstat numbers are forgotten. A has 96 pages more than expected.

    As memory cgroups are created and destroyed, errors do accumulate. Even
    1-2 pages differences can accumulate into large numbers.

    To fix this issue let's accumulate and propagate percpu vmstat values
    before releasing the memory cgroup. At this point these numbers are
    stable and cannot be changed.

    Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
    only over online cpus.

    Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
    Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • After commit 907ec5fca3dc ("mm: zero remaining unavailable struct
    pages"), struct page of reserved memory is zeroed. This causes
    page->flags to be 0 and fixes issues related to reading
    /proc/kpageflags, for example, of reserved memory.

    The VM_BUG_ON() in move_freepages_block(), however, assumes that
    page_zone() is meaningful even for reserved memory. That assumption is
    no longer true after the aforementioned commit.

    There's no reason why move_freepages_block() should be testing the
    legitimacy of page_zone() for reserved memory; its scope is limited only
    to pages on the zone's freelist.

    Note that pfn_valid() can be true for reserved memory: there is a
    backing struct page. The check for page_to_nid(page) is also buggy but
    reserved memory normally only appears on node 0 so the zeroing doesn't
    affect this.

    Move the debug checks to after verifying PageBuddy is true. This
    isolates the scope of the checks to only be for buddy pages which are on
    the zone's freelist which move_freepages_block() is operating on. In
    this case, an incorrect node or zone is a bug worthy of being warned
    about (and the examination of struct page is acceptable bcause this
    memory is not reserved).

    Why does move_freepages_block() gets called on reserved memory? It's
    simply math after finding a valid free page from the per-zone free area
    to use as fallback. We find the beginning and end of the pageblock of
    the valid page and that can bring us into memory that was reserved per
    the e820. pfn_valid() is still true (it's backed by a struct page), but
    since it's zero'd we shouldn't make any inferences here about comparing
    its node or zone. The current node check just happens to succeed most
    of the time by luck because reserved memory typically appears on node 0.

    The fix here is to validate that we actually have buddy pages before
    testing if there's any type of zone or node strangeness going on.

    We noticed it almost immediately after bringing 907ec5fca3dc in on
    CONFIG_DEBUG_VM builds. It depends on finding specific free pages in
    the per-zone free area where the math in move_freepages() will bring the
    start or end pfn into reserved memory and wanting to claim that entire
    pageblock as a new migratetype. So the path will be rare, require
    CONFIG_DEBUG_VM, and require fallback to a different migratetype.

    Some struct pages were already zeroed from reserve pages before
    907ec5fca3c so it theoretically could trigger before this commit. I
    think it's rare enough under a config option that most people don't run
    that others may not have noticed. I wouldn't argue against a stable tag
    and the backport should be easy enough, but probably wouldn't single out
    a commit that this is fixing.

    Mel said:

    : The overhead of the debugging check is higher with this patch although
    : it'll only affect debug builds and the path is not particularly hot.
    : If this was a concern, I think it would be reasonable to simply remove
    : the debugging check as the zone boundaries are checked in
    : move_freepages_block and we never expect a zone/node to be smaller than
    : a pageblock and stuck in the middle of another zone.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Masayoshi Mizuma
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq).
    However, we have no guarantee that migration isn't happening in the
    background at that time.

    Migration directly calls queue_work_on(pool->compact_wq), if destruction
    wins that race we are using a destroyed workqueue.

    Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.com
    Signed-off-by: Henry Burns
    Cc: Vitaly Wool
    Cc: Shakeel Butt
    Cc: Jonathan Adams
    Cc: Henry Burns
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     

14 Aug, 2019

15 commits

  • Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
    the kernel-v5.2.3 testing. This is caused by a race between hugetlb
    page migration and page fault.

    If a hugetlb page can not be allocated to satisfy a page fault, the task
    is sent SIGBUS. This is normal hugetlbfs behavior. A hugetlb fault
    mutex exists to prevent two tasks from trying to instantiate the same
    page. This protects against the situation where there is only one
    hugetlb page, and both tasks would try to allocate. Without the mutex,
    one would fail and SIGBUS even though the other fault would be
    successful.

    There is a similar race between hugetlb page migration and fault.
    Migration code will allocate a page for the target of the migration. It
    will then unmap the original page from all page tables. It does this
    unmap by first clearing the pte and then writing a migration entry. The
    page table lock is held for the duration of this clear and write
    operation. However, the beginnings of the hugetlb page fault code
    optimistically checks the pte without taking the page table lock. If
    clear (as it can be during the migration unmap operation), a hugetlb
    page allocation is attempted to satisfy the fault. Note that the page
    which will eventually satisfy this fault was already allocated by the
    migration code. However, the allocation within the fault path could
    fail which would result in the task incorrectly being sent SIGBUS.

    Ideally, we could take the hugetlb fault mutex in the migration code
    when modifying the page tables. However, locks must be taken in the
    order of hugetlb fault mutex, page lock, page table lock. This would
    require significant rework of the migration code. Instead, the issue is
    addressed in the hugetlb fault code. After failing to allocate a huge
    page, take the page table lock and check for huge_pte_none before
    returning an error. This is the same check that must be made further in
    the code even if page allocation is successful.

    Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Mike Kravetz
    Reported-by: Li Wang
    Tested-by: Li Wang
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Cyril Hrubis
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
    ("mm: reclaim small amounts of memory when an external fragmentation
    event occurs").

    The report is extensive:

    https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/

    and it's worth recording the most relevant parts (colorful language and
    typos included).

    When running a simple, steady state 4kB file creation test to
    simulate extracting tarballs larger than memory full of small
    files into the filesystem, I noticed that once memory fills up
    the cache balance goes to hell.

    The workload is creating one dirty cached inode for every dirty
    page, both of which should require a single IO each to clean and
    reclaim, and creation of inodes is throttled by the rate at which
    dirty writeback runs at (via balance dirty pages). Hence the ingest
    rate of new cached inodes and page cache pages is identical and
    steady. As a result, memory reclaim should quickly find a steady
    balance between page cache and inode caches.

    The moment memory fills, the page cache is reclaimed at a much
    faster rate than the inode cache, and evidence suggests that
    the inode cache shrinker is not being called when large batches
    of pages are being reclaimed. In roughly the same time period
    that it takes to fill memory with 50% pages and 50% slab caches,
    memory reclaim reduces the page cache down to just dirty pages
    and slab caches fill the entirety of memory.

    The LRU is largely full of dirty pages, and we're getting spikes
    of random writeback from memory reclaim so it's all going to shit.
    Behaviour never recovers, the page cache remains pinned at just
    dirty pages, and nothing I could tune would make any difference.
    vfs_cache_pressure makes no difference - I would set it so high
    it should trim the entire inode caches in a single pass, yet it
    didn't do anything. It was clear from tracing and live telemetry
    that the shrinkers were pretty much not running except when
    there was absolutely no memory free at all, and then they did
    the minimum necessary to free memory to make progress.

    So I went looking at the code, trying to find places where pages
    got reclaimed and the shrinkers weren't called. There's only one
    - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
    reclaim small amounts of memory when an external fragmentation
    event occurs").

    The watermark boosting introduced by the commit is triggered in response
    to an allocation "fragmentation event". The boosting was not intended
    to target THP specifically and triggers even if THP is disabled.
    However, with Dave's perfectly reasonable workload, fragmentation events
    can be very common given the ratio of slab to page cache allocations so
    boosting remains active for long periods of time.

    As high-order allocations might use compaction and compaction cannot
    move slab pages the decision was made in the commit to special-case
    kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
    reclaiming slab does not directly help compaction.

    As Dave notes, this decision means that slab can be artificially
    protected for long periods of time and messes up the balance with slab
    and page caches.

    Removing the special casing can still indirectly help avoid
    fragmentation by avoiding fragmentation-causing events due to slab
    allocation as pages from a slab pageblock will have some slab objects
    freed. Furthermore, with the special casing, reclaim behaviour is
    unpredictable as kswapd sometimes examines slab and sometimes does not
    in a manner that is tricky to tune or analyse.

    This patch removes the special casing. The downside is that this is not
    a universal performance win. Some benchmarks that depend on the
    residency of data when rereading metadata may see a regression when slab
    reclaim is restored to its original behaviour. Similarly, some
    benchmarks that only read-once or write-once may perform better when
    page reclaim is too aggressive. The primary upside is that slab
    shrinker is less surprising (arguably more sane but that's a matter of
    opinion), behaves consistently regardless of the fragmentation state of
    the system and properly obeys VM sysctls.

    A fsmark benchmark configuration was constructed similar to what Dave
    reported and is codified by the mmtest configuration
    config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
    machine to avoid dealing with NUMA-related issues and the timing of
    reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
    filesystem was used for the test data.

    This is not an exact replication of Dave's setup. The configuration
    scales its parameters depending on the memory size of the SUT to behave
    similarly across machines. The parameters mean the first sample
    reported by fs_mark is using 50% of RAM which will barely be throttled
    and look like a big outlier. Dave used fake NUMA to have multiple
    kswapd instances which I didn't replicate. Finally, the number of
    iterations differ from Dave's test as the target disk was not large
    enough. While not identical, it should be representative.

    fsmark
    5.3.0-rc3 5.3.0-rc3
    vanilla shrinker-v1r1
    Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
    1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
    2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
    3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
    Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
    Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
    Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
    Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
    Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
    CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
    BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
    BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
    BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
    BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
    BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
    BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)

    5.3.0-rc3 5.3.0-rc3
    vanillashrinker-v1r1
    Duration User 501.82 497.29
    Duration System 4401.44 4424.08
    Duration Elapsed 8124.76 8358.05

    This is showing a slight skew for the max result representing a large
    outlier for the 1st, 2nd and 3rd quartile are similar indicating that
    the bulk of the results show little difference. Note that an earlier
    version of the fsmark configuration showed a regression but that
    included more samples taken while memory was still filling.

    Note that the elapsed time is higher. Part of this is that the
    configuration included time to delete all the test files when the test
    completes -- the test automation handles the possibility of testing
    fsmark with multiple thread counts. Without the patch, many of these
    objects would be memory resident which is part of what the patch is
    addressing.

    There are other important observations that justify the patch.

    1. With the vanilla kernel, the number of dirty pages in the system is
    very low for much of the test. With this patch, dirty pages is
    generally kept at 10% which matches vm.dirty_background_ratio which
    is normal expected historical behaviour.

    2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
    0.95 for much of the test i.e. Slab is being left alone and
    dominating memory consumption. With the patch applied, the ratio
    varies between 0.35 and 0.45 with the bulk of the measured ratios
    roughly half way between those values. This is a different balance to
    what Dave reported but it was at least consistent.

    3. Slabs are scanned throughout the entire test with the patch applied.
    The vanille kernel has periods with no scan activity and then
    relatively massive spikes.

    4. Without the patch, kswapd scan rates are very variable. With the
    patch, the scan rates remain quite steady.

    4. Overall vmstats are closer to normal expectations

    5.3.0-rc3 5.3.0-rc3
    vanilla shrinker-v1r1
    Ops Direct pages scanned 99388.00 328410.00
    Ops Kswapd pages scanned 45382917.00 33451026.00
    Ops Kswapd pages reclaimed 30869570.00 25239655.00
    Ops Direct pages reclaimed 74131.00 5830.00
    Ops Kswapd efficiency % 68.02 75.45
    Ops Kswapd velocity 5585.75 4002.25
    Ops Page reclaim immediate 1179721.00 430927.00
    Ops Slabs scanned 62367361.00 73581394.00
    Ops Direct inode steals 2103.00 1002.00
    Ops Kswapd inode steals 570180.00 5183206.00

    o Vanilla kernel is hitting direct reclaim more frequently,
    not very much in absolute terms but the fact the patch
    reduces it is interesting
    o "Page reclaim immediate" in the vanilla kernel indicates
    dirty pages are being encountered at the tail of the LRU.
    This is generally bad and means in this case that the LRU
    is not long enough for dirty pages to be cleaned by the
    background flush in time. This is much reduced by the
    patch.
    o With the patch, kswapd is reclaiming 10 times more slab
    pages than with the vanilla kernel. This is indicative
    of the watermark boosting over-protecting slab

    A more complete set of tests were run that were part of the basis for
    introducing boosting and while there are some differences, they are well
    within tolerances.

    Bottom line, the special casing kswapd to avoid slab behaviour is
    unpredictable and can lead to abnormal results for normal workloads.

    This patch restores the expected behaviour that slab and page cache is
    balanced consistently for a workload with a steady allocation ratio of
    slab/pagecache pages. It also means that if there are workloads that
    favour the preservation of slab over pagecache that it can be tuned via
    vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
    the parameter when boosting is active.

    Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reviewed-by: Dave Chinner
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local
    hugepage allocations").

    commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a
    severe regression that was reported by the kernel test robot at the end
    of the merge window. Now we understood the regression was a false
    positive and was caused by a significant increase in fairness during a
    swap trashing benchmark. So it's safe to re-apply the fix and continue
    improving the code from there. The benchmark that reported the
    regression is very useful, but it provides a meaningful result only when
    there is no significant alteration in fairness during the workload. The
    removal of __GFP_THISNODE increased fairness.

    __GFP_THISNODE cannot be used in the generic page faults path for new
    memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
    behavior significantly deviates from what the MPOL_DEFAULT semantics are
    supposed to be for THP and 4k allocations alike.

    Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
    set to "madvise") has never meant to provide an implicit MPOL_BIND on
    the "current" node the task is running on, causing swap storms and
    providing a much more aggressive behavior than even zone_reclaim_node =
    3.

    Any workload who could have benefited from __GFP_THISNODE has now to
    enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
    the zone_reclaim_mode behavior, but it only did so if THP was enabled:
    if THP was disabled, there would have been no chance to get any 4k page
    from the current node if the current node was full of pagecache, which
    further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
    MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
    semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
    must work exactly the same with MADV_HUGEPAGE set or not.

    The performance characteristic of memory depends on the hardware
    details. The numbers below are obtained on Naples/EPYC architecture and
    the N/A projection extends them to show what we should aim for in the
    future as a good THP NUMA locality default. The benchmark used
    exercises random memory seeks (note: the cost of the page faults is not
    part of the measurement).

    D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
    0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A

    D0 means distance zero (i.e. local memory), D1 means distance one (i.e.
    intra socket memory), D2 means distance two (i.e. inter socket memory),
    etc...

    For the guest physical memory allocated by qemu and for guest mode
    kernel the performance characteristic of RAM is more complex and an
    ideal default could be:

    D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
    0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A

    NOTE: the N/A are projections and haven't been measured yet, the
    measurement in this case is done on a 1950x with only two NUMA nodes.
    The THP case here means THP was used both in the host and in the guest.

    After applying this commit the THP NUMA locality order that we'll get
    out of MADV_HUGEPAGE is this:

    D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Before this commit it was:

    D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Even if we ignore the breakage of large workloads that can't fit in a
    single node that the __GFP_THISNODE implicit "current node" mbind
    caused, the THP NUMA locality order provided by __GFP_THISNODE was still
    not the one we shall aim for in the long term (i.e. the first one at
    the top).

    After this commit is applied, we can introduce a new allocator multi
    order API and to replace those two alloc_pages_vmas calls in the page
    fault path, with a single multi order call:

    unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
    page = alloc_pages_multi_order(..., &order);
    if (!page)
    goto out;
    if (!(order & (1 << 0))) {
    VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
    /* THP fault */
    } else {
    VM_WARN_ON(order != 1 << 0);
    /* 4k fallback */
    }

    The page allocator logic has to be altered so that when it fails on any
    zone with order 9, it has to try again with a order 0 before falling
    back to the next zone in the zonelist.

    After that we need to do more measurements and evaluate if adding an
    opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
    with "DN+1 THP | DN 4k" at every NUMA distance crossing.

    Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

    The fixes for what was originally reported as "pathological THP
    behavior" we rightfully reverted to be sure not to introduced
    regressions at end of a merge window after a severe regression report
    from the kernel bot. We can safely re-apply them now that we had time
    to analyze the problem.

    The mm process worked fine, because the good fixes were eventually
    committed upstream without excessive delay.

    The regression reported by the kernel bot however forced us to revert
    the good fixes to be sure not to introduce regressions and to give us
    the time to analyze the issue further. The silver lining is that this
    extra time allowed to think more at this issue and also plan for a
    future direction to improve things further in terms of THP NUMA
    locality.

    This patch (of 2):

    This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
    gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
    89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
    alloc_hugepage_direct_gfpmask").

    Consolidation of the THP allocation flags at the same place was meant to
    be a clean up to easier handle otherwise scattered code which is
    imposing a maintenance burden. There were no real problems observed
    with the gfp mask consolidation but the reversion was rushed through
    without a larger consensus regardless.

    This patch brings the consolidation back because this should make the
    long term maintainability easier as well as it should allow future
    changes to be less error prone.

    [mhocko@kernel.org: changelog additions]
    Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Memcg counters for shadow nodes are broken because the memcg pointer is
    obtained in a wrong way. The following approach is used:
    virt_to_page(xa_node)->mem_cgroup

    Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
    page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
    set for slab pages, so memcg_from_slab_page() should be used instead.

    Also I doubt that it ever worked correctly: virt_to_head_page() should
    be used instead of virt_to_page(). Otherwise objects residing on tail
    pages are not accounted, because only the head page contains a valid
    mem_cgroup pointer. That was a case since the introduction of these
    counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
    for shadow nodes").

    Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently, when checking to see if accessing n bytes starting at address
    "ptr" will cause a wraparound in the memory addresses, the check in
    check_bogus_address() adds an extra byte, which is incorrect, as the
    range of addresses that will be accessed is [ptr, ptr + (n - 1)].

    This can lead to incorrectly detecting a wraparound in the memory
    address, when trying to read 4 KB from memory that is mapped to the the
    last possible page in the virtual address space, when in fact, accessing
    that range of memory would not cause a wraparound to occur.

    Use the memory range that will actually be accessed when considering if
    accessing a certain amount of bytes will cause the memory address to
    wrap around.

    Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Signed-off-by: Prasad Sodagudi
    Signed-off-by: Isaac J. Manjarres
    Co-developed-by: Prasad Sodagudi
    Reviewed-by: William Kucharski
    Acked-by: Kees Cook
    Cc: Greg Kroah-Hartman
    Cc: Trilok Soni
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Isaac J. Manjarres
     
  • If an error occurs during kmemleak_init() (e.g. kmem cache cannot be
    created), kmemleak is disabled but kmemleak_early_log remains enabled.
    Subsequently, when the .init.text section is freed, the log_early()
    function no longer exists. To avoid a page fault in such scenario,
    ensure that kmemleak_disable() also disables early logging.

    Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas
    Reported-by: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Recent changes to the vmalloc code by commit 68ad4a330433
    ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
    cause spurious percpu allocation failures. These, in turn, can result
    in panic()s in the slub code. One such possible panic was reported by
    Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
    Another related panic observed is,

    RIP: 0033:0x7f46f7441b9b
    Call Trace:
    dump_stack+0x61/0x80
    pcpu_alloc.cold.30+0x22/0x4f
    mem_cgroup_css_alloc+0x110/0x650
    cgroup_apply_control_enable+0x133/0x330
    cgroup_mkdir+0x41b/0x500
    kernfs_iop_mkdir+0x5a/0x90
    vfs_mkdir+0x102/0x1b0
    do_mkdirat+0x7d/0xf0
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
    to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
    uses two lists (vmap_area_list & free_vmap_area_list) to track the used
    and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
    sizes[], nr_vms, align) function is used for allocating congruent VM
    areas for percpu memory allocator. In order to not conflict with
    VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
    VMALLOC space. So the search for free vm_area for the given requirement
    starts near VMALLOC_END and moves upwards towards VMALLOC_START.

    Prior to commit 68ad4a330433, the search for free vm_area in
    pcpu_get_vm_areas() involves following two main steps.

    Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
    Step 2:
    Loop through number of requested vm_areas and check,
    Step 2.1:
    if (base < VMALLOC_START)
    1. fail with error
    Step 2.2:
    // end is offsets[area] + sizes[area]
    if (base + end > va->vm_end)
    1. Move the base downwards and repeat Step 2
    Step 2.3:
    if (base + start < va->vm_start)
    1. Move to previous free vm_area node, find aligned
    base address and repeat Step 2

    But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

    Step 2.3:
    if (base + start < va->vm_start || base + end > va->vm_end)
    1. Move to previous free vm_area node, find aligned
    base address and repeat Step 2

    Above change is the root cause of spurious percpu memory allocation
    failures. For example, consider a case where a relatively large vm_area
    (~ 30 TB) was ignored in free vm_area search because it did not pass the
    base + end < vm->vm_end boundary check. Ignoring such large free
    vm_area's would lead to not finding free vm_area within boundary of
    VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

    So modify the search algorithm to include Step 2.2.

    Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
    Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
    Signed-off-by: Kuppuswamy Sathyanarayanan
    Reported-by: Dave Hansen
    Acked-by: Dennis Zhou
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc: Roman Gushchin
    Cc: sathyanarayanan kuppuswamy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuppuswamy Sathyanarayanan
     
  • This patch is sent to report an use after free in mem_cgroup_iter()
    after merging commit be2657752e9e ("mm: memcg: fix use after free in
    mem_cgroup_iter()").

    I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
    ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
    to the trees. However, I can still observe use after free issues
    addressed in the commit be2657752e9e. (on low-end devices, a few times
    this month)

    backtrace:
    css_tryget stat);
    + /* poison memcg before freeing it */
    + memset(memcg, 0x78, sizeof(struct mem_cgroup));
    kfree(memcg);
    }

    The coredump shows the position=0xdbbc2a00 is freed.

    (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
    $13 = {position = 0xdbbc2a00, generation = 0x2efd}

    0xdbbc2a00: 0xdbbc2e00 0x00000000 0xdbbc2800 0x00000100
    0xdbbc2a10: 0x00000200 0x78787878 0x00026218 0x00000000
    0xdbbc2a20: 0xdcad6000 0x00000001 0x78787800 0x00000000
    0xdbbc2a30: 0x78780000 0x00000000 0x0068fb84 0x78787878
    0xdbbc2a40: 0x78787878 0x78787878 0x78787878 0xe3fa5cc0
    0xdbbc2a50: 0x78787878 0x78787878 0x00000000 0x00000000
    0xdbbc2a60: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a70: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a80: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a90: 0x00000001 0x00000000 0x00000000 0x00100000
    0xdbbc2aa0: 0x00000001 0xdbbc2ac8 0x00000000 0x00000000
    0xdbbc2ab0: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2ac0: 0x00000000 0x00000000 0xe5b02618 0x00001000
    0xdbbc2ad0: 0x00000000 0x78787878 0x78787878 0x78787878
    0xdbbc2ae0: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2af0: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b00: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b10: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b20: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b30: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b40: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b50: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b60: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b70: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b80: 0x78787878 0x78787878 0x00000000 0x78787878
    0xdbbc2b90: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2ba0: 0x78787878 0x78787878 0x78787878 0x78787878

    In the reclaim path, try_to_free_pages() does not setup
    sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
    shrink_node().

    In mem_cgroup_iter(), root is set to root_mem_cgroup because
    sc->target_mem_cgroup is NULL. It is possible to assign a memcg to
    root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().

    try_to_free_pages
    struct scan_control sc = {...}, target_mem_cgroup is 0x0;
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup *root = sc->target_mem_cgroup;
    memcg = mem_cgroup_iter(root, NULL, &reclaim);
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...

    css = css_next_descendant_pre(css, &root->css);
    memcg = mem_cgroup_from_css(css);
    cmpxchg(&iter->position, pos, memcg);

    My device uses memcg non-hierarchical mode. When we release a memcg:
    invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
    If non-hierarchical mode is used, invalidate_reclaim_iterators() never
    reaches root_mem_cgroup.

    static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
    {
    struct mem_cgroup *memcg = dead_memcg;

    for (; memcg; memcg = parent_mem_cgroup(memcg)
    ...
    }

    So the use after free scenario looks like:

    CPU1 CPU2

    try_to_free_pages
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...
    css = css_next_descendant_pre(css, &root->css);
    memcg = mem_cgroup_from_css(css);
    cmpxchg(&iter->position, pos, memcg);

    invalidate_reclaim_iterators(memcg);
    ...
    __mem_cgroup_free()
    kfree(memcg);

    try_to_free_pages
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...
    mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
    iter = &mz->iter[reclaim->priority];
    pos = READ_ONCE(iter->position);
    css_tryget(&pos->css)
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • The constraint from the zpool use of z3fold_destroy_pool() is there are
    no outstanding handles to memory (so no active allocations), but it is
    possible for there to be outstanding work on either of the two wqs in
    the pool.

    Calling z3fold_deregister_migration() before the workqueues are drained
    means that there can be allocated pages referencing a freed inode,
    causing any thread in compaction to be able to trip over the bad pointer
    in PageMovable().

    Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
    Fixes: 1f862989b04a ("mm/z3fold.c: support page migration")
    Signed-off-by: Henry Burns
    Reviewed-by: Shakeel Butt
    Reviewed-by: Jonathan Adams
    Cc: Vitaly Vul
    Cc: Vitaly Wool
    Cc: David Howells
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Henry Burns
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     
  • The constraint from the zpool use of z3fold_destroy_pool() is there are
    no outstanding handles to memory (so no active allocations), but it is
    possible for there to be outstanding work on either of the two wqs in
    the pool.

    If there is work queued on pool->compact_workqueue when it is called,
    z3fold_destroy_pool() will do:

    z3fold_destroy_pool()
    destroy_workqueue(pool->release_wq)
    destroy_workqueue(pool->compact_wq)
    drain_workqueue(pool->compact_wq)
    do_compact_page(zhdr)
    kref_put(&zhdr->refcount)
    __release_z3fold_page(zhdr, ...)
    queue_work_on(pool->release_wq, &pool->work) *BOOM*

    So compact_wq needs to be destroyed before release_wq.

    Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
    Fixes: 5d03a6613957 ("mm/z3fold.c: use kref to prevent page free/compact race")
    Signed-off-by: Henry Burns
    Reviewed-by: Shakeel Butt
    Reviewed-by: Jonathan Adams
    Cc: Vitaly Vul
    Cc: Vitaly Wool
    Cc: David Howells
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henry Burns
     
  • When running syzkaller internally, we ran into the below bug on 4.9.x
    kernel:

    kernel BUG at mm/huge_memory.c:2124!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
    task: ffff880067b34900 task.stack: ffff880068998000
    RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
    Call Trace:
    split_huge_page include/linux/huge_mm.h:100 [inline]
    queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:90 [inline]
    walk_pgd_range mm/pagewalk.c:116 [inline]
    __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
    walk_page_range+0x154/0x370 mm/pagewalk.c:285
    queue_pages_range+0x115/0x150 mm/mempolicy.c:694
    do_mbind mm/mempolicy.c:1241 [inline]
    SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
    SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
    do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
    entry_SYSCALL_64_after_swapgs+0x5d/0xdb
    Code: c7 80 1c 02 00 e8 26 0a 76 01 0b 48 c7 c7 40 46 45 84 e8 4c
    RIP [] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
    RSP

    with the below test:

    uint64_t r[1] = {0xffffffffffffffff};

    int main(void)
    {
    syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    intptr_t res = 0;
    res = syscall(__NR_socket, 0x11, 3, 0x300);
    if (res != -1)
    r[0] = res;
    *(uint32_t*)0x20000040 = 0x10000;
    *(uint32_t*)0x20000044 = 1;
    *(uint32_t*)0x20000048 = 0xc520;
    *(uint32_t*)0x2000004c = 1;
    syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
    syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
    *(uint64_t*)0x20000340 = 2;
    syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
    return 0;
    }

    Actually the test does:

    mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
    socket(AF_PACKET, SOCK_RAW, 768) = 3
    setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
    mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
    mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0

    The setsockopt() would allocate compound pages (16 pages in this test)
    for packet tx ring, then the mmap() would call packet_mmap() to map the
    pages into the user address space specified by the mmap() call.

    When calling mbind(), it would scan the vma to queue the pages for
    migration to the new node. It would split any huge page since 4.9
    doesn't support THP migration, however, the packet tx ring compound
    pages are not THP and even not movable. So, the above bug is triggered.

    However, the later kernel is not hit by this issue due to commit
    d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"),
    which just removes the PageSwapBacked check for a different reason.

    But, there is a deeper issue. According to the semantic of mbind(), it
    should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
    MPOL_MF_STRICT was also specified, but the kernel was unable to move all
    existing pages in the range. The tx ring of the packet socket is
    definitely not movable, however, mbind() returns success for this case.

    Although the most socket file associates with non-movable pages, but XDP
    may have movable pages from gup. So, it sounds not fine to just check
    the underlying file type of vma in vma_migratable().

    Change migrate_page_add() to check if the page is movable or not, if it
    is unmovable, just return -EIO. But do not abort pte walk immediately,
    since there may be pages off LRU temporarily. We should migrate other
    pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some
    paged could not be not moved, then return -EIO for mbind() eventually.

    With this change the above test would return -EIO as expected.

    [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
    Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
    try best to migrate misplaced pages, if some of the pages could not be
    migrated, then return -EIO.

    There are three different sub-cases:
    1. vma is not migratable
    2. vma is migratable, but there are unmovable pages
    3. vma is migratable, pages are movable, but migrate_pages() fails

    If #1 happens, kernel would just abort immediately, then return -EIO,
    after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
    MPOL_MF_STRICT is specified").

    If #3 happens, kernel would set policy and migrate pages with
    best-effort, but won't rollback the migrated pages and reset the policy
    back.

    Before that commit, they behaves in the same way. It'd better to keep
    their behavior consistent. But, rolling back the migrated pages and
    resetting the policy back sounds not feasible, so just make #1 behave as
    same as #3.

    Userspace will know that not everything was successfully migrated (via
    -EIO), and can take whatever steps it deems necessary - attempt
    rollback, determine which exact page(s) are violating the policy, etc.

    Make queue_pages_range() return 1 to indicate there are unmovable pages
    or vma is not migratable.

    The #2 is not handled correctly in the current kernel, the following
    patch will fix it.

    [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
    Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When migrating an anonymous private page to a ZONE_DEVICE private page,
    the source page->mapping and page->index fields are copied to the
    destination ZONE_DEVICE struct page and the page_mapcount() is
    increased. This is so rmap_walk() can be used to unmap and migrate the
    page back to system memory.

    However, try_to_unmap_one() computes the subpage pointer from a swap pte
    which computes an invalid page pointer and a kernel panic results such
    as:

    BUG: unable to handle page fault for address: ffffea1fffffffc8

    Currently, only single pages can be migrated to device private memory so
    no subpage computation is needed and it can be set to "page".

    [rcampbell@nvidia.com: add comment]
    Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
    Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Signed-off-by: Ralph Campbell
    Cc: "Jérôme Glisse"
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Ira Weiny
    Cc: Jan Kara
    Cc: Lai Jiangshan
    Cc: Logan Gunthorpe
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • When a ZONE_DEVICE private page is freed, the page->mapping field can be
    set. If this page is reused as an anonymous page, the previous value
    can prevent the page from being inserted into the CPU's anon rmap table.
    For example, when migrating a pte_none() page to device memory:

    migrate_vma(ops, vma, start, end, src, dst, private)
    migrate_vma_collect()
    src[] = MIGRATE_PFN_MIGRATE
    migrate_vma_prepare()
    /* no page to lock or isolate so OK */
    migrate_vma_unmap()
    /* no page to unmap so OK */
    ops->alloc_and_copy()
    /* driver allocates ZONE_DEVICE page for dst[] */
    migrate_vma_pages()
    migrate_vma_insert_page()
    page_add_new_anon_rmap()
    __page_set_anon_rmap()
    /* This check sees the page's stale mapping field */
    if (PageAnon(page))
    return
    /* page->mapping is not updated */

    The result is that the migration appears to succeed but a subsequent CPU
    fault will be unable to migrate the page back to system memory or worse.

    Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
    pointer data doesn't affect future page use.

    Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
    Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: "Jérôme Glisse"
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

10 Aug, 2019

1 commit

  • Currently, attempts to shutdown and re-enable a device-dax instance
    trigger:

    Missing reference count teardown definition
    WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850
    [..]
    RIP: 0010:devm_memremap_pages+0x234/0x850
    [..]
    Call Trace:
    dev_dax_probe+0x66/0x190 [device_dax]
    really_probe+0xef/0x390
    driver_probe_device+0xb4/0x100
    device_driver_attach+0x4f/0x60

    Given that the setup path initializes pgmap->ref, arrange for it to be
    also torn down so devm_memremap_pages() is ready to be called again and
    not be mistaken for the 3rd-party per-cpu-ref case.

    Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in struct dev_pagemap")
    Reported-by: Fan Du
    Tested-by: Vishal Verma
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

03 Aug, 2019

7 commits

  • memremap.c implements MM functionality for ZONE_DEVICE, so it really
    should be in the mm/ directory, not the kernel/ one.

    Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Anshuman Khandual
    Acked-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • return is unneeded in void function

    Link: http://lkml.kernel.org/r/20190723130814.21826-1-houweitaoo@gmail.com
    Signed-off-by: Weitao Hou
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weitao Hou
     
  • When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
    migrate_vma_collect() which initializes a struct mm_walk but didn't
    initialize mm_walk.pud_entry. (Found by code inspection) Use a C
    structure initialization to make sure it is set to NULL.

    Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
    Fixes: 8763cb45ab967 ("mm/migrate: new memory migration helper for use with device memory")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Andrew Morton
    Cc: "Jérôme Glisse"
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • "howaboutsynergy" reported via kernel buzilla number 204165 that
    compact_zone_order was consuming 100% CPU during a stress test for
    prolonged periods of time. Specifically the following command, which
    should exit in 10 seconds, was taking an excessive time to finish while
    the CPU was pegged at 100%.

    stress -m 220 --vm-bytes 1000000000 --timeout 10

    Tracing indicated a pattern as follows

    stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0

    Note that compaction is entered in rapid succession while scanning and
    isolating nothing. The problem is that when a task that is compacting
    receives a fatal signal, it retries indefinitely instead of exiting
    while making no progress as a fatal signal is pending.

    It's not easy to trigger this condition although enabling zswap helps on
    the basis that the timing is altered. A very small window has to be hit
    for the problem to occur (signal delivered while compacting and
    isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).

    This was reproduced locally -- 16G single socket system, 8G swap, 30%
    zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng
    implementation from github running in a loop until the problem hits).
    Tracing recorded the problem occurring almost 200K times in a short
    window. With this patch, the problem hit 4 times but the task existed
    normally instead of consuming CPU.

    This problem has existed for some time but it was made worse by commit
    cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as
    contention"). Before that commit, if the same condition was hit then
    locks would be quickly contended and compaction would exit that way.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
    Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
    Fixes: cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as contention")
    Signed-off-by: Mel Gorman
    Reviewed-by: Vlastimil Babka
    Cc: [5.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • buffer_migrate_page_norefs() can race with bh users in the following
    way:

    CPU1 CPU2
    buffer_migrate_page_norefs()
    buffer_migrate_lock_buffers()
    checks bh refs
    spin_unlock(&mapping->private_lock)
    __find_get_block()
    spin_lock(&mapping->private_lock)
    grab bh ref
    spin_unlock(&mapping->private_lock)
    move page do bh work

    This can result in various issues like lost updates to buffers (i.e.
    metadata corruption) or use after free issues for the old page.

    This patch closes the race by holding mapping->private_lock while the
    mapping is being moved to a new page. Ordinarily, a reference can be
    taken outside of the private_lock using the per-cpu BH LRU but the
    references are checked and the LRU invalidated if necessary. The
    private_lock is held once the references are known so the buffer lookup
    slow path will spin on the private_lock. Between the page lock and
    private_lock, it should be impossible for other references to be
    acquired and updates to happen during the migration.

    A user had reported data corruption issues on a distribution kernel with
    a similar page migration implementation as mainline. The data
    corruption could not be reproduced with this patch applied. A small
    number of migration-intensive tests were run and no performance problems
    were noted.

    [mgorman@techsingularity.net: Changelog, removed tracing]
    Link: http://lkml.kernel.org/r/20190718090238.GF24383@techsingularity.net
    Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
    Signed-off-by: Jan Kara
    Signed-off-by: Mel Gorman
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Shakeel Butt reported premature oom on kernel with
    "cgroup_disable=memory" since mem_cgroup_is_root() returns false even
    though memcg is actually NULL. The drop_caches is also broken.

    It is because commit aeed1d325d42 ("mm/vmscan.c: generalize
    shrink_slab() calls in shrink_node()") removed the !memcg check before
    !mem_cgroup_is_root(). And, surprisingly root memcg is allocated even
    though memory cgroup is disabled by kernel boot parameter.

    Add mem_cgroup_disabled() check to make reclaimer work as expected.

    Link: http://lkml.kernel.org/r/1563385526-20805-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: aeed1d325d42 ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()")
    Signed-off-by: Yang Shi
    Reported-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Jan Hadrava
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Hugh Dickins
    Cc: Qian Cai
    Cc: Kirill A. Shutemov
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When running ltp's oom test with kmemleak enabled, the below warning was
    triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
    passed in:

    WARNING: CPU: 105 PID: 2138 at mm/page_alloc.c:4608 __alloc_pages_nodemask+0x1c31/0x1d50
    Modules linked in: loop dax_pmem dax_pmem_core ip_tables x_tables xfs virtio_net net_failover virtio_blk failover ata_generic virtio_pci virtio_ring virtio libata
    CPU: 105 PID: 2138 Comm: oom01 Not tainted 5.2.0-next-20190710+ #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__alloc_pages_nodemask+0x1c31/0x1d50
    ...
    kmemleak_alloc+0x4e/0xb0
    kmem_cache_alloc+0x2a7/0x3e0
    mempool_alloc_slab+0x2d/0x40
    mempool_alloc+0x118/0x2b0
    bio_alloc_bioset+0x19d/0x350
    get_swap_bio+0x80/0x230
    __swap_writepage+0x5ff/0xb20

    The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
    has __GFP_NOFAIL set all the time due to d9570ee3bd1d4f2 ("kmemleak:
    allow to coexist with fault injection"). But, it doesn't make any sense
    to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
    time.

    According to the discussion on the mailing list, the commit should be
    reverted for short term solution. Catalin Marinas would follow up with
    a better solution for longer term.

    The failure rate of kmemleak metadata allocation may increase in some
    circumstances, but this should be expected side effect.

    Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: d9570ee3bd1d4f2 ("kmemleak: allow to coexist with fault injection")
    Signed-off-by: Yang Shi
    Suggested-by: Catalin Marinas
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: David Rientjes
    Cc: Matthew Wilcox
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

01 Aug, 2019

1 commit

  • To properly clear the slab on free with slab_want_init_on_free, we walk
    the list of free objects using get_freepointer/set_freepointer.

    The value we get from get_freepointer may not be valid. This isn't an
    issue since an actual value will get written later but this means
    there's a chance of triggering a bug if we use this value with
    set_freepointer:

    kernel BUG at mm/slub.c:306!
    invalid opcode: 0000 [#1] PREEMPT PTI
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4
    RIP: 0010:kfree+0x58a/0x5c0
    Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
    RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
    RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
    RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
    RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
    R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
    FS: 0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
    Call Trace:
    apply_wqattrs_prepare+0x154/0x280
    apply_workqueue_attrs_locked+0x4e/0xe0
    apply_workqueue_attrs+0x36/0x60
    alloc_workqueue+0x25a/0x6d0
    workqueue_init_early+0x246/0x348
    start_kernel+0x3c7/0x7ec
    x86_64_start_reservations+0x40/0x49
    x86_64_start_kernel+0xda/0xe4
    secondary_startup_64+0xb6/0xc0
    Modules linked in:
    ---[ end trace f67eb9af4d8d492b ]---

    Fix this by ensuring the value we set with set_freepointer is either NULL
    or another value in the chain.

    Reported-by: kernel test robot
    Signed-off-by: Laura Abbott
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Reviewed-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

31 Jul, 2019

1 commit

  • Pull HMM fixes from Jason Gunthorpe:
    "Fix the locking around nouveau's use of the hmm_range_* APIs. It works
    correctly in the success case, but many of the the edge cases have
    missing unlocks or double unlocks.

    The diffstat is a bit big as Christoph did a comprehensive job to move
    the obsolete API from the core header and into the driver before
    fixing its flow, but the risk of regression from this code motion is
    low"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    nouveau: unlock mmap_sem on all errors from nouveau_range_fault
    nouveau: remove the block parameter to nouveau_range_fault
    mm/hmm: move hmm_vma_range_done and hmm_vma_fault to nouveau
    mm/hmm: always return EBUSY for invalid ranges in hmm_range_{fault,snapshot}

    Linus Torvalds
     

30 Jul, 2019

1 commit

  • Pull virtio/vhost fixes from Michael Tsirkin:

    - Fixes in the iommu and balloon devices.

    - Disable the meta-data optimization for now - I hope we can get it
    fixed shortly, but there's no point in making users suffer crashes
    while we are working on that.

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    vhost: disable metadata prefetch optimization
    iommu/virtio: Update to most recent specification
    balloon: fix up comments
    mm/balloon_compaction: avoid duplicate page removal

    Linus Torvalds