16 Nov, 2019

1 commit

  • Recently, I hit the following issue when running upstream.

    kernel BUG at mm/vmscan.c:1521!
    invalid opcode: 0000 [#1] SMP KASAN PTI
    CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
    RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
    Call Trace:
    reclaim_pages+0x499/0x800 mm/vmscan.c:2188
    madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
    walk_pmd_range mm/pagewalk.c:53 [inline]
    walk_pud_range mm/pagewalk.c:112 [inline]
    walk_p4d_range mm/pagewalk.c:139 [inline]
    walk_pgd_range mm/pagewalk.c:166 [inline]
    __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
    walk_page_range+0x179/0x310 mm/pagewalk.c:349
    madvise_pageout_page_range mm/madvise.c:506 [inline]
    madvise_pageout+0x1f0/0x330 mm/madvise.c:542
    madvise_vma mm/madvise.c:931 [inline]
    __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
    do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    madvise_pageout() accesses the specified range of the vma and isolates
    them, then runs shrink_page_list() to reclaim its memory. But it also
    isolates the unevictable pages to reclaim. Hence, we can catch the
    cases in shrink_page_list().

    The root cause is that we scan the page tables instead of specific LRU
    list. and so we need to filter out the unevictable lru pages from our
    end.

    Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: zhong jiang
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

26 Sep, 2019

4 commits

  • There are many common parts between MADV_COLD and MADV_PAGEOUT.
    This patch factor them out to save code duplication.

    Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: James E.J. Bottomley
    Cc: Joel Fernandes (Google)
    Cc: kbuild test robot
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When a process expects no accesses to a certain memory range for a long
    time, it could hint kernel that the pages can be reclaimed instantly but
    data should be preserved for future use. This could reduce workingset
    eviction so it ends up increasing performance.

    This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
    MADV_PAGEOUT can be used by a process to mark a memory range as not
    expected to be used for a long time so that kernel reclaims *any LRU*
    pages instantly. The hint can help kernel in deciding which pages to
    evict proactively.

    A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
    intentionally because it's automatically bounded by PMD size. If PMD
    size(e.g., 256) makes some trouble, we could fix it later by limit it to
    SWAP_CLUSTER_MAX[1].

    - man-page material

    MADV_PAGEOUT (since Linux x.x)

    Do not expect access in the near future so pages in the specified
    regions could be reclaimed instantly regardless of memory pressure.
    Thus, access in the range after successful operation could cause
    major page fault but never lose the up-to-date contents unlike
    MADV_DONTNEED. Pages belonging to a shared mapping are only processed
    if a write access is allowed for the calling process.

    MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
    VM_PFNMAP pages.

    [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

    [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
    Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Sep, 2019

1 commit

  • madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
    identical code.

    Move that code to a common error handling path.

    No functional changes.

    Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pankaj Gupta
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

22 Sep, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

31 Aug, 2019

1 commit

  • Currently handling of MADV_WILLNEED hint calls directly into readahead
    code. Handle it by calling vfs_fadvise() instead so that filesystem can
    use its ->fadvise() callback to acquire necessary locks or otherwise
    prepare for the request.

    Suggested-by: Amir Goldstein
    Reviewed-by: Boaz Harrosh
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Jan Kara
     

03 Jul, 2019

1 commit

  • The code hasn't been used since it was added to the tree, and doesn't
    appear to actually be usable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Acked-by: Michal Hocko
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 May, 2019

2 commits

  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

03 Apr, 2019

1 commit

  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

06 Oct, 2018

1 commit

  • Reproducer, assuming 2M of hugetlbfs available:

    Hugetlbfs mounted, size=2M and option user=testuser

    # mount | grep ^hugetlbfs
    hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
    # sysctl vm.nr_hugepages=1
    vm.nr_hugepages = 1
    # grep Huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 1
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 2048 kB

    Code:

    #include
    #include
    #define SIZE 2*1024*1024
    int main()
    {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
    }

    Compile and strace:

    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
    madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
    madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

    hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
    author testing with analysis from Florian Weimer[1].

    The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
    consequence of the large useage of VM_DONTEXPAND in device drivers.

    A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
    marked DODUMP.

    A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
    memory for a while and later request that madvise(MADV_DODUMP) on the same
    memory. We correct this omission by allowing madvice(MADV_DODUMP) on
    hugetlbfs pages.

    [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
    [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

    Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
    Link: https://lists.launchpad.net/maria-discuss/msg05245.html
    Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
    Reported-by: Kenneth Penza
    Signed-off-by: Daniel Black
    Reviewed-by: Mike Kravetz
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Daniel Black
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

24 Jul, 2018

1 commit

  • The madvise_inject_error() routine uses get_user_pages() to lookup the
    pfn and other information for injected error, but it does not release
    that pin. The assumption is that failed pages should be taken out of
    circulation.

    However, for dax mappings it is not possible to take pages out of
    circulation since they are 1:1 physically mapped as filesystem blocks,
    or device-dax capacity. They also typically represent persistent memory
    which has an error clearing capability.

    In preparation for adding a special handler for dax mappings, shift the
    responsibility of taking the page reference to memory_failure(). I.e.
    drop the page reference and do not specify MF_COUNT_INCREASED to
    memory_failure().

    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Signed-off-by: Dan Williams
    Acked-by: Naoya Horiguchi
    Signed-off-by: Dave Jiang

    Dan Williams
     

24 Jan, 2018

1 commit


30 Nov, 2017

1 commit

  • MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
    Unfortunately madvise_willneed() doesn't communicate this information
    properly to the generic madvise syscall implementation. The calling
    convention is quite subtle there. madvise_vma() is supposed to either
    return an error or update &prev otherwise the main loop will never
    advance to the next vma and it will keep looping for ever without a way
    to get out of the kernel.

    It seems this has been broken since introduction. Nobody has noticed
    because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

    [mhocko@suse.com: rewrite changelog]
    Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
    Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
    Signed-off-by: chenjie
    Signed-off-by: guoxuenan
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: zhangyi (F)
    Cc: Miao Xie
    Cc: Mike Rapoport
    Cc: Shaohua Li
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Rik van Riel
    Cc: Carsten Otte
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    chenjie
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

1 commit

  • mm/madvise.c has a brief description about all MADV_ flags. Add a
    description for the newly added MADV_WIPEONFORK and MADV_KEEPONFORK.

    Although man page has the similar information, but it'd better to keep
    the consistent with other flags.

    Link: http://lkml.kernel.org/r/1506117328-88228-1-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

04 Oct, 2017

1 commit

  • This fixes a bug in madvise() where if you'd try to soft offline a
    hugepage via madvise(), while walking the address range you'd end up,
    using the wrong page offset due to attempting to get the compound order
    of a former but presently not compound page, due to dissolving the huge
    page (since commit c3114a84f7f9: "mm: hugetlb: soft-offline: dissolve
    source hugepage after successful migration").

    As a result I ended up with all my free pages except one being offlined.

    Link: http://lkml.kernel.org/r/20170912204306.GA12053@gmail.com
    Fixes: c3114a84f7f9 ("mm: hugetlb: soft-offline: dissolve source hugepage after successful migration")
    Signed-off-by: Alexandru Moise
    Cc: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Shaohua Li
    Cc: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     

09 Sep, 2017

1 commit

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

1 commit

  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

01 Sep, 2017

1 commit

  • Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
    and bisected it to the commit 479f854a207c ("mm, page_alloc: defer
    debugging checks of pages allocated from the PCP").

    The problem is that a page that was poisoned with madvise() is reused.
    The commit removed a check that would trigger if DEBUG_VM was enabled
    but re-enabling the check only fixes the problem as a side-effect by
    printing a bad_page warning and recovering.

    The root of the problem is that an madvise() can leave a poisoned page
    on the per-cpu list. This patch drains all per-cpu lists after pages
    are poisoned so that they will not be reused. Wendy reports that the
    test case in question passes with this patch applied. While this could
    be done in a targeted fashion, it is over-complicated for such a rare
    operation.

    Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Signed-off-by: Mel Gorman
    Reported-by: Wang, Wendy
    Tested-by: Wang, Wendy
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Hansen, Dave"
    Cc: "Luck, Tony"
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Aug, 2017

1 commit

  • If madvise(..., MADV_FREE) split a transparent hugepage, it called
    put_page() before unlock_page().

    This was wrong because put_page() can free the page, e.g. if a
    concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
    mapping. put_page() then rightfully complained about freeing a locked
    page.

    Fix this by moving the unlock_page() before put_page().

    This bug was found by syzkaller, which encountered the following splat:

    BUG: Bad page state in process syzkaller412798 pfn:1bd800
    page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00
    flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
    raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
    raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in:
    CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    bad_page+0x230/0x2b0 mm/page_alloc.c:565
    free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
    free_pages_check mm/page_alloc.c:952 [inline]
    free_pages_prepare mm/page_alloc.c:1043 [inline]
    free_pcp_prepare mm/page_alloc.c:1068 [inline]
    free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
    __put_single_page mm/swap.c:79 [inline]
    __put_page+0xfb/0x160 mm/swap.c:113
    put_page include/linux/mm.h:814 [inline]
    madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:108 [inline]
    walk_p4d_range mm/pagewalk.c:134 [inline]
    walk_pgd_range mm/pagewalk.c:160 [inline]
    __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
    walk_page_range+0x200/0x470 mm/pagewalk.c:326
    madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
    madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
    madvise_dontneed_free mm/madvise.c:555 [inline]
    madvise_vma mm/madvise.c:664 [inline]
    SYSC_madvise mm/madvise.c:832 [inline]
    SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Here is a C reproducer:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MADV_FREE 8
    #define PAGE_SIZE 4096

    static void *mapping;
    static const size_t mapping_size = 0x1000000;

    static void *madvise_thrproc(void *arg)
    {
    madvise(mapping, mapping_size, (long)arg);
    }

    int main(void)
    {
    pthread_t t[2];

    for (;;) {
    mapping = mmap(NULL, mapping_size, PROT_WRITE,
    MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

    munmap(mapping + mapping_size / 2, PAGE_SIZE);

    pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
    pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
    pthread_join(t[0], NULL);
    pthread_join(t[1], NULL);
    munmap(mapping, mapping_size);
    }
    }

    Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
    CONFIG_DEBUG_VM=y are needed.

    Google Bug Id: 64696096

    Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
    Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
    Signed-off-by: Eric Biggers
    Acked-by: David Rientjes
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: [v4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Jul, 2017

2 commits

  • MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
    monitor. The monitor has to stop handling #PF events in the range being
    freed. We are reusing userfaultfd_remove callback along with the logic
    required to re-get and re-validate the VMA which may change or disappear
    because userfaultfd_remove releases mmap_sem.

    Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 May, 2017

5 commits

  • madvise_behavior_valid() should be called before acting upon the
    behavior parameter. Hence move up the function. This also includes
    MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter
    for the system call madvise().

    Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: David Rientjes
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called
    through madvise() system call.

    * madvise_memory_failure() was misleading to accommodate handling of
    both memory_failure() as well as soft_offline_page() functions.
    Basically it handles memory error injection from user space which
    can go either way as memory failure or soft offline. Renamed as
    madvise_inject_error() instead.

    * Renamed struct page pointer 'p' to 'page'.

    * pr_info() was essentially printing PFN value but it said 'page'
    which was misleading. Made the process virtual address explicit.

    Before the patch:

    Soft offlining page 0x15e3e at 0x3fff8c230000
    Soft offlining page 0x1f3 at 0x3fffa0da0000
    Soft offlining page 0x744 at 0x3fff7d200000
    Soft offlining page 0x1634d at 0x3fff95e20000
    Soft offlining page 0x16349 at 0x3fff95e30000
    Soft offlining page 0x1d6 at 0x3fff9e8b0000
    Soft offlining page 0x5f3 at 0x3fff91bd0000

    Injecting memory failure for page 0x15c8b at 0x3fff83280000
    Injecting memory failure for page 0x16190 at 0x3fff83290000
    Injecting memory failure for page 0x740 at 0x3fff9a2e0000
    Injecting memory failure for page 0x741 at 0x3fff9a2f0000

    After the patch:

    Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000
    Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000
    Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000
    Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000
    Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000
    Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000
    Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000
    Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000

    Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000
    Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000
    Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000
    Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000
    Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000
    Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000
    Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000

    Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Now MADV_FREE pages can be easily reclaimed even for swapless system.
    We can safely enable MADV_FREE for all systems.

    Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When memory pressure is high, we free MADV_FREE pages. If the pages are
    not dirty in pte, the pages could be freed immediately. Otherwise we
    can't reclaim them. We put the pages back to anonumous LRU list (by
    setting SwapBacked flag) and the pages will be reclaimed in normal
    swapout way.

    We use normal page reclaim policy. Since MADV_FREE pages are put into
    inactive file list, such pages and inactive file pages are reclaimed
    according to their age. This is expected, because we don't want to
    reclaim too many MADV_FREE pages before used once pages.

    Based on Minchan's original patch

    [minchan@kernel.org: clean up lazyfree page handling]
    Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
    Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Signed-off-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

10 Mar, 2017

1 commit

  • userfaultfd_remove() has to be execute before zapping the pagetables or
    UFFDIO_COPY could keep filling pages after zap_page_range returned,
    which would result in non zero data after a MADV_DONTNEED.

    However userfaultfd_remove() may have to release the mmap_sem. This was
    handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
    potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

    The fix consists in revalidating the vma in case userfaultfd_remove()
    had to release the mmap_sem.

    This also optimizes away an unnecessary down_read/up_read in the
    MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

    It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
    userfaultfd_remove() will be defined as "true" at build time.

    Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Feb, 2017

3 commits

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a page is removed from a shared mapping, the uffd reader should be
    notified, so that it won't attempt to handle #PF events for the removed
    pages.

    We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point
    of view, the semantices of madvise(MADV_DONTNEED) and
    madvise(MADV_REMOVE) is exactly the same.

    Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hillf Danton
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport