26 Sep, 2019

1 commit

  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 Sep, 2019

2 commits

  • A later patch makes THP deferred split shrinker memcg aware, but it needs
    page->mem_cgroup information in THP destructor, which is called after
    mem_cgroup_uncharge() now.

    So move mem_cgroup_uncharge() from __page_cache_release() to compound page
    destructor, which is called by both THP and other compound pages except
    HugeTLB. And call it in __put_single_page() for single order page.

    Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: "Kirill A . Shutemov"
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • This is a cleanup patch that replaces two historical uses of
    list_move_tail() with relatively recent add_page_to_lru_list_tail().

    Link: http://lkml.kernel.org/r/20190716212436.7137-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jason Gunthorpe
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     

15 Jul, 2019

2 commits

  • The stuff under sysctl describes /sys interface from userspace
    point of view. So, add it to the admin-guide and remove the
    :orphan: from its index file.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Rename the /proc/sys/ documentation files to ReST, using the
    README file as a template for an index.rst, adding the other
    files there via TOC markup.

    Despite being written on different times with different
    styles, try to make them somewhat coherent with a similar
    look and feel, ensuring that they'll look nice as both
    raw text file and as via the html output produced by the
    Sphinx build system.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

02 Jul, 2019

1 commit

  • release_pages() is an optimized version of a loop around put_page().
    Unfortunately for devmap pages the logic is not entirely correct in
    release_pages(). This is because device pages can be more than type
    MEMORY_DEVICE_PUBLIC. There are in fact 4 types, private, public, FS DAX,
    and PCI P2PDMA. Some of these have specific needs to "put" the page while
    others do not.

    This logic to handle any special needs is contained in
    put_devmap_managed_page(). Therefore all devmap pages should be processed
    by this function where we can contain the correct logic for a page put.

    Handle all device type pages within release_pages() by calling
    put_devmap_managed_page() on all devmap pages. If
    put_devmap_managed_page() returns true the page has been put and we
    continue with the next page. A false return of put_devmap_managed_page()
    means the page did not require special processing and should fall to
    "normal" processing.

    This was found via code inspection while determining if release_pages()
    and the new put_user_pages() could be interchangeable.[1]

    [1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com

    Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com
    Cc: Jérôme Glisse
    Cc: Michal Hocko
    Reviewed-by: Dan Williams
    Reviewed-by: John Hubbard
    Signed-off-by: Ira Weiny
    Signed-off-by: Jason Gunthorpe

    Ira Weiny
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • There is no function named munlock_vma_pages(). Correct it to
    munlock_vma_page().

    Link: http://lkml.kernel.org/r/20190402095609.27181-1-peng.fan@nxp.com
    Signed-off-by: Peng Fan
    Reviewed-by: Andrew Morton
    Reviewed-by: Mukesh Ojha
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peng Fan
     

06 Mar, 2019

1 commit

  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

22 Feb, 2019

1 commit

  • Since for_each_cpu(cpu, mask) added by commit 2d3854a37e8b767a
    ("cpumask: introduce new API, without changing anything") did not
    evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n,
    lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by
    commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without
    INIT_WORK().") by unconditionally calling flush_work() [1].

    Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all
    implementation. There is no real need to defer the implementation to
    the workqueue as the draining is going to happen on the local cpu. So
    alias lru_add_drain_all to lru_add_drain which does all the necessary
    work.

    [akpm@linux-foundation.org: fix various build warnings]
    [1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net
    Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Guenter Roeck
    Debugged-by: Tetsuo Handa
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Jan, 2019

1 commit

  • Multiple filesystems open code lru_to_page(). Rectify this by moving
    the macro from mm_inline (which is specific to lru stuff) to the more
    generic mm.h header and start using the macro where appropriate.

    No functional changes.

    Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
    Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Mike Rapoport
    Acked-by: Pankaj gupta
    Acked-by: "Yan, Zheng" [ceph]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

13 Nov, 2018

1 commit

  • lockdep_assert_held() is better suited to checking locking requirements,
    since it only checks if the current thread holds the lock regardless of
    whether someone else does. This is also a step towards possibly removing
    spin_is_locked().

    Signed-off-by: Lance Roy
    Cc: Andrew Morton
    Cc: "Kirill A. Shutemov"
    Cc: Yang Shi
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Paul E. McKenney

    Lance Roy
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

1 commit

  • Remove duplicated include linux/memremap.h

    Link: http://lkml.kernel.org/r/20180917131308.16420-1-yuehaibing@huawei.com
    Signed-off-by: YueHaibing
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     

21 Oct, 2018

1 commit


30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

22 May, 2018

1 commit

  • In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
    be able to rely on the fact that they will get wakeups on dev_pagemap
    page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
    generic_dax_page_free() as common indicator / infrastructure for dax
    filesytems to require. With this change there are no users of the
    MEMORY_DEVICE_HOST designation, so remove it.

    The HMM sub-system extended dev_pagemap to arrange a callback when a
    dev_pagemap managed page is freed. Since a dev_pagemap page is free /
    idle when its reference count is 1 it requires an additional branch to
    check the page-type at put_page() time. Given put_page() is a hot-path
    we do not want to incur that check if HMM is not in use, so a static
    branch is used to avoid that overhead when not necessary.

    Now, the FS_DAX implementation wants to reuse this mechanism for
    receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
    static-key into a generic mechanism that either HMM or FS_DAX code paths
    can enable.

    For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
    care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
    However, we still need to support FS_DAX in the FS_DAX_LIMITED case
    implemented by the s390/dcssblk driver.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michal Hocko
    Reported-by: kbuild test robot
    Reported-by: Thomas Meyer
    Reported-by: Dave Jiang
    Cc: "Jérôme Glisse"
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

06 Apr, 2018

1 commit

  • The 'cold' parameter was removed from release_pages function by commit
    c6f92f9fbe7d ("mm: remove cold parameter for release_pages").

    Update the description to match the code.

    Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

22 Feb, 2018

2 commits

  • There was a conflict between the commit e02a9f048ef7 ("mm/swap.c: make
    functions and their kernel-doc agree") and the commit f144c390f905 ("mm:
    docs: fix parameter names mismatch") that both tried to fix mismatch
    betweeen pagevec_lookup_entries() parameter names and their description.

    Since nr_entries is a better name for the parameter, fix the description
    again.

    Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Randy Dunlap
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • When a thread mlocks an address space backed either by file pages which
    are currently not present in memory or swapped out anon pages (not in
    swapcache), a new page is allocated and added to the local pagevec
    (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
    On I/O completion, the thread can wake on a different CPU, the mlock
    syscall will then sets the PageMlocked() bit of the page but will not be
    able to put that page in unevictable LRU as the page is on the pagevec
    of a different CPU. Even on drain, that page will go to evictable LRU
    because the PageMlocked() bit is not checked on pagevec drain.

    The page will eventually go to right LRU on reclaim but the LRU stats
    will remain skewed for a long time.

    This patch puts all the pages, even unevictable, to the pagevecs and on
    the drain, the pages will be added on their LRUs correctly by checking
    their evictability. This resolves the mlocked pages on pagevec of other
    CPUs issue because when those pagevecs will be drained, the mlocked file
    pages will go to unevictable LRU. Also this makes the race with munlock
    easier to resolve because the pagevec drains happen in LRU lock.

    However there is still one place which makes a page evictable and does
    PageLRU check on that page without LRU lock and needs special attention.
    TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

    #0: __pagevec_lru_add_fn #1: clear_page_mlock

    SetPageLRU() if (!TestClearPageMlocked())
    return
    smp_mb() //
    Acked-by: Vlastimil Babka
    Cc: Jérôme Glisse
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Jan Kara
    Cc: Nicholas Piggin
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

07 Feb, 2018

1 commit

  • There are several places where parameter descriptions do no match the
    actual code. Fix it.

    Link: http://lkml.kernel.org/r/1516700871-22279-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

01 Feb, 2018

2 commits

  • Fix some basic kernel-doc notation in mm/swap.c:

    - for function lru_cache_add_anon(), make its kernel-doc function name
    match its function name and change colon to hyphen following the
    function name

    - for function pagevec_lookup_entries(), change the function parameter
    name from nr_pages to nr_entries since that is more descriptive of
    what the parameter actually is and then it matches the kernel-doc
    comments also

    Fix function kernel-doc to match the change in commit 67fd707f4681:

    - drop the kernel-doc notation for @nr_pages from
    pagevec_lookup_range() and correct the function description for that
    change

    Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
    Fixes: 67fd707f4681 ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
    Signed-off-by: Randy Dunlap
    Reviewed-by: Andrew Morton
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Pulling cpu hotplug locks inside the mm core function like
    lru_add_drain_all just asks for problems and the recent lockdep splat
    [1] just proves this. While the usage in that particular case might be
    wrong we should avoid the locking as lru_add_drain_all() is used in many
    places. It seems that this is not all that hard to achieve actually.

    We have done the same thing for drain_all_pages which is analogous by
    commit a459eeb7b852 ("mm, page_alloc: do not depend on cpu hotplug locks
    inside the allocator"). All we have to care about is to handle

    - the work item might be executed on a different cpu in worker from
    unbound pool so it doesn't run on pinned on the cpu

    - we have to make sure that we do not race with page_alloc_cpu_dead
    calling lru_add_drain_cpu

    the first part is already handled because the worker calls lru_add_drain
    which disables preemption when calling lru_add_drain_cpu on the local
    cpu it is draining. The later is true because page_alloc_cpu_dead is
    called on the controlling CPU after the hotplugged CPU vanished
    completely.

    [1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com

    [add a cpu hotplug locking interaction as per tglx]
    Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

16 Nov, 2017

8 commits

  • According to Vlastimil Babka, the drained field in pagevec is
    potentially misleading because it might be interpreted as draining this
    pagevec instead of the percpu lru pagevecs. Rename the field for
    clarity.

    Link: http://lkml.kernel.org/r/20171019093346.ylahzdpzmoriyf4v@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Most callers users of free_hot_cold_page claim the pages being released
    are cache hot. The exception is the page reclaim paths where it is
    likely that enough pages will be freed in the near future that the
    per-cpu lists are going to be recycled and the cache hotness information
    is lost. As no one really cares about the hotness of pages being
    released to the allocator, just ditch the parameter.

    The APIs are renamed to indicate that it's no longer about hot/cold
    pages. It should also be less confusing as there are subtle differences
    between them. __free_pages drops a reference and frees a page when the
    refcount reaches zero. free_hot_cold_page handled pages whose refcount
    was already zero which is non-obvious from the name. free_unref_page
    should be more obvious.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    [mgorman@techsingularity.net: add pages to head, not tail]
    Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
    Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • All callers of release_pages claim the pages being released are cache
    hot. As no one cares about the hotness of pages being released to the
    allocator, just ditch the parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a pagevec is initialised on the stack, it is generally used
    multiple times over a range of pages, looking up entries and then
    releasing them. On each pagevec_release, the per-cpu deferred LRU
    pagevecs are drained on the grounds the page being released may be on
    those queues and the pages may be cache hot. In many cases only the
    first drain is necessary as it's unlikely that the range of pages being
    walked is racing against LRU addition. Even if there is such a race,
    the impact is marginal where as constantly redraining the lru pagevecs
    costs.

    This patch ensures that pagevec is only drained once in a given
    lifecycle without increasing the cache footprint of the pagevec
    structure. Only sparsetruncate tiny is shown here as large files have
    many exceptional entries and calls pagecache_release less frequently.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    batchshadow-v1r1 onedrain-v1r1
    Min Time 141.00 ( 0.00%) 141.00 ( 0.00%)
    1st-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 146.00 ( 0.00%) 145.00 ( 0.68%)
    Max-99% Time 198.00 ( 0.00%) 194.00 ( 2.02%)
    Max Time 254.00 ( 0.00%) 208.00 ( 18.11%)
    Amean Time 145.12 ( 0.00%) 144.30 ( 0.56%)
    Stddev Time 12.74 ( 0.00%) 9.62 ( 24.49%)
    Coeff Time 8.78 ( 0.00%) 6.67 ( 24.06%)
    Best99%Amean Time 144.29 ( 0.00%) 143.82 ( 0.32%)
    Best95%Amean Time 142.68 ( 0.00%) 142.31 ( 0.26%)
    Best90%Amean Time 142.52 ( 0.00%) 142.19 ( 0.24%)
    Best75%Amean Time 142.26 ( 0.00%) 141.98 ( 0.20%)
    Best50%Amean Time 141.90 ( 0.00%) 141.71 ( 0.13%)
    Best25%Amean Time 141.80 ( 0.00%) 141.43 ( 0.26%)

    The impact on bonnie is marginal and within the noise because a
    significant percentage of the file being truncated has been reclaimed
    and consists of shadow entries which reduce the hotness of the
    pagevec_release path.

    Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

    Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently pagevec_lookup_range_tag() takes number of pages to look up
    but most users don't need this. Create a new function
    pagevec_lookup_range_nr_tag() that takes maximum number of pages to
    lookup for Ceph which wants this functionality so that we can drop
    nr_pages argument from pagevec_lookup_range_tag().

    Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "Ranged pagevec tagged lookup", v3.

    In this series I provide a ranged variant of pagevec_lookup_tag() and
    use it in places where it makes sense. This series removes some common
    code and it also has a potential for speeding up some operations
    similarly as for pagevec_lookup_range() (but for now I can think of only
    artificial cases where this happens).

    This patch (of 16):

    Implement a variant of find_get_pages_tag() that stops iterating at
    given index. Lots of users of this function (through pagevec_lookup())
    actually want a range lookup and all of them are currently open-coding
    this.

    Also create corresponding pagevec_lookup_range_tag() function.

    Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Cc: Bob Peterson
    Cc: Chao Yu
    Cc: David Howells
    Cc: David Sterba
    Cc: Ilya Dryomov
    Cc: Jaegeuk Kim
    Cc: Ryusuke Konishi
    Cc: Steve French
    Cc: "Theodore Ts'o"
    Cc: "Yan, Zheng"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

04 Oct, 2017

1 commit

  • MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
    SwapBacked). There is no lock to prevent the page is added to swap
    cache between these two steps by page reclaim. Page reclaim could add
    the page to swap cache and unmap the page. After page reclaim, the page
    is added back to lru. At that time, we probably start draining per-cpu
    pagevec and mark the page lazyfree. So the page could be in a state
    with SwapBacked cleared and PG_swapcache set. Next time there is a
    refault in the virtual address, do_swap_page can find the page from swap
    cache but the page has PageSwapCache false because SwapBacked isn't set,
    so do_swap_page will bail out and do nothing. The task will keep
    running into fault handler.

    Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
    Link: http://lkml.kernel.org/r/6537ef3814398c0073630b03f176263bc81f0902.1506446061.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Reported-by: Artem Savkov
    Tested-by: Artem Savkov
    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

09 Sep, 2017

1 commit

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

3 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement a variant of find_get_pages() that stops iterating at given
    index. This may be substantial performance gain if the mapping is
    sparse. See following commit for details. Furthermore lots of users of
    this function (through pagevec_lookup()) actually want a range lookup
    and all of them are currently open-coding this.

    Also create corresponding pagevec_lookup_range() function.

    Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

11 Jul, 2017

1 commit

  • The rework of the cpu hotplug locking unearthed potential deadlocks with
    the memory hotplug locking code.

    The solution for these is to rework the memory hotplug locking code as
    well and take the cpu hotplug lock before the memory hotplug lock in
    mem_hotplug_begin(), but this will cause a recursive locking of the cpu
    hotplug lock when the memory hotplug code calls lru_add_drain_all().

    Split out the inner workings of lru_add_drain_all() into
    lru_add_drain_all_cpuslocked() so this function can be invoked from the
    memory hotplug code with the cpu hotplug lock held.

    Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
    Signed-off-by: Thomas Gleixner
    Reported-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Vladimir Davydov
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Jul, 2017

1 commit

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

04 May, 2017

1 commit

  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li