24 Mar, 2019

1 commit

  • [ Upstream commit 6ea183d60c469560e7b08a83c9804299e84ec9eb ]

    Since for_each_cpu(cpu, mask) added by commit 2d3854a37e8b767a
    ("cpumask: introduce new API, without changing anything") did not
    evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n,
    lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by
    commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without
    INIT_WORK().") by unconditionally calling flush_work() [1].

    Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all
    implementation. There is no real need to defer the implementation to
    the workqueue as the draining is going to happen on the local cpu. So
    alias lru_add_drain_all to lru_add_drain which does all the necessary
    work.

    [akpm@linux-foundation.org: fix various build warnings]
    [1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net
    Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Guenter Roeck
    Debugged-by: Tetsuo Handa
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     

22 May, 2018

1 commit

  • In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
    be able to rely on the fact that they will get wakeups on dev_pagemap
    page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
    generic_dax_page_free() as common indicator / infrastructure for dax
    filesytems to require. With this change there are no users of the
    MEMORY_DEVICE_HOST designation, so remove it.

    The HMM sub-system extended dev_pagemap to arrange a callback when a
    dev_pagemap managed page is freed. Since a dev_pagemap page is free /
    idle when its reference count is 1 it requires an additional branch to
    check the page-type at put_page() time. Given put_page() is a hot-path
    we do not want to incur that check if HMM is not in use, so a static
    branch is used to avoid that overhead when not necessary.

    Now, the FS_DAX implementation wants to reuse this mechanism for
    receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
    static-key into a generic mechanism that either HMM or FS_DAX code paths
    can enable.

    For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
    care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
    However, we still need to support FS_DAX in the FS_DAX_LIMITED case
    implemented by the s390/dcssblk driver.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michal Hocko
    Reported-by: kbuild test robot
    Reported-by: Thomas Meyer
    Reported-by: Dave Jiang
    Cc: "Jérôme Glisse"
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

06 Apr, 2018

1 commit

  • The 'cold' parameter was removed from release_pages function by commit
    c6f92f9fbe7d ("mm: remove cold parameter for release_pages").

    Update the description to match the code.

    Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

22 Feb, 2018

2 commits

  • There was a conflict between the commit e02a9f048ef7 ("mm/swap.c: make
    functions and their kernel-doc agree") and the commit f144c390f905 ("mm:
    docs: fix parameter names mismatch") that both tried to fix mismatch
    betweeen pagevec_lookup_entries() parameter names and their description.

    Since nr_entries is a better name for the parameter, fix the description
    again.

    Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Randy Dunlap
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • When a thread mlocks an address space backed either by file pages which
    are currently not present in memory or swapped out anon pages (not in
    swapcache), a new page is allocated and added to the local pagevec
    (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
    On I/O completion, the thread can wake on a different CPU, the mlock
    syscall will then sets the PageMlocked() bit of the page but will not be
    able to put that page in unevictable LRU as the page is on the pagevec
    of a different CPU. Even on drain, that page will go to evictable LRU
    because the PageMlocked() bit is not checked on pagevec drain.

    The page will eventually go to right LRU on reclaim but the LRU stats
    will remain skewed for a long time.

    This patch puts all the pages, even unevictable, to the pagevecs and on
    the drain, the pages will be added on their LRUs correctly by checking
    their evictability. This resolves the mlocked pages on pagevec of other
    CPUs issue because when those pagevecs will be drained, the mlocked file
    pages will go to unevictable LRU. Also this makes the race with munlock
    easier to resolve because the pagevec drains happen in LRU lock.

    However there is still one place which makes a page evictable and does
    PageLRU check on that page without LRU lock and needs special attention.
    TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

    #0: __pagevec_lru_add_fn #1: clear_page_mlock

    SetPageLRU() if (!TestClearPageMlocked())
    return
    smp_mb() //
    Acked-by: Vlastimil Babka
    Cc: Jérôme Glisse
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Jan Kara
    Cc: Nicholas Piggin
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

07 Feb, 2018

1 commit

  • There are several places where parameter descriptions do no match the
    actual code. Fix it.

    Link: http://lkml.kernel.org/r/1516700871-22279-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

01 Feb, 2018

2 commits

  • Fix some basic kernel-doc notation in mm/swap.c:

    - for function lru_cache_add_anon(), make its kernel-doc function name
    match its function name and change colon to hyphen following the
    function name

    - for function pagevec_lookup_entries(), change the function parameter
    name from nr_pages to nr_entries since that is more descriptive of
    what the parameter actually is and then it matches the kernel-doc
    comments also

    Fix function kernel-doc to match the change in commit 67fd707f4681:

    - drop the kernel-doc notation for @nr_pages from
    pagevec_lookup_range() and correct the function description for that
    change

    Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
    Fixes: 67fd707f4681 ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
    Signed-off-by: Randy Dunlap
    Reviewed-by: Andrew Morton
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Pulling cpu hotplug locks inside the mm core function like
    lru_add_drain_all just asks for problems and the recent lockdep splat
    [1] just proves this. While the usage in that particular case might be
    wrong we should avoid the locking as lru_add_drain_all() is used in many
    places. It seems that this is not all that hard to achieve actually.

    We have done the same thing for drain_all_pages which is analogous by
    commit a459eeb7b852 ("mm, page_alloc: do not depend on cpu hotplug locks
    inside the allocator"). All we have to care about is to handle

    - the work item might be executed on a different cpu in worker from
    unbound pool so it doesn't run on pinned on the cpu

    - we have to make sure that we do not race with page_alloc_cpu_dead
    calling lru_add_drain_cpu

    the first part is already handled because the worker calls lru_add_drain
    which disables preemption when calling lru_add_drain_cpu on the local
    cpu it is draining. The later is true because page_alloc_cpu_dead is
    called on the controlling CPU after the hotplugged CPU vanished
    completely.

    [1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com

    [add a cpu hotplug locking interaction as per tglx]
    Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

16 Nov, 2017

8 commits

  • According to Vlastimil Babka, the drained field in pagevec is
    potentially misleading because it might be interpreted as draining this
    pagevec instead of the percpu lru pagevecs. Rename the field for
    clarity.

    Link: http://lkml.kernel.org/r/20171019093346.ylahzdpzmoriyf4v@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Most callers users of free_hot_cold_page claim the pages being released
    are cache hot. The exception is the page reclaim paths where it is
    likely that enough pages will be freed in the near future that the
    per-cpu lists are going to be recycled and the cache hotness information
    is lost. As no one really cares about the hotness of pages being
    released to the allocator, just ditch the parameter.

    The APIs are renamed to indicate that it's no longer about hot/cold
    pages. It should also be less confusing as there are subtle differences
    between them. __free_pages drops a reference and frees a page when the
    refcount reaches zero. free_hot_cold_page handled pages whose refcount
    was already zero which is non-obvious from the name. free_unref_page
    should be more obvious.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    [mgorman@techsingularity.net: add pages to head, not tail]
    Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
    Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • All callers of release_pages claim the pages being released are cache
    hot. As no one cares about the hotness of pages being released to the
    allocator, just ditch the parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a pagevec is initialised on the stack, it is generally used
    multiple times over a range of pages, looking up entries and then
    releasing them. On each pagevec_release, the per-cpu deferred LRU
    pagevecs are drained on the grounds the page being released may be on
    those queues and the pages may be cache hot. In many cases only the
    first drain is necessary as it's unlikely that the range of pages being
    walked is racing against LRU addition. Even if there is such a race,
    the impact is marginal where as constantly redraining the lru pagevecs
    costs.

    This patch ensures that pagevec is only drained once in a given
    lifecycle without increasing the cache footprint of the pagevec
    structure. Only sparsetruncate tiny is shown here as large files have
    many exceptional entries and calls pagecache_release less frequently.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    batchshadow-v1r1 onedrain-v1r1
    Min Time 141.00 ( 0.00%) 141.00 ( 0.00%)
    1st-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 146.00 ( 0.00%) 145.00 ( 0.68%)
    Max-99% Time 198.00 ( 0.00%) 194.00 ( 2.02%)
    Max Time 254.00 ( 0.00%) 208.00 ( 18.11%)
    Amean Time 145.12 ( 0.00%) 144.30 ( 0.56%)
    Stddev Time 12.74 ( 0.00%) 9.62 ( 24.49%)
    Coeff Time 8.78 ( 0.00%) 6.67 ( 24.06%)
    Best99%Amean Time 144.29 ( 0.00%) 143.82 ( 0.32%)
    Best95%Amean Time 142.68 ( 0.00%) 142.31 ( 0.26%)
    Best90%Amean Time 142.52 ( 0.00%) 142.19 ( 0.24%)
    Best75%Amean Time 142.26 ( 0.00%) 141.98 ( 0.20%)
    Best50%Amean Time 141.90 ( 0.00%) 141.71 ( 0.13%)
    Best25%Amean Time 141.80 ( 0.00%) 141.43 ( 0.26%)

    The impact on bonnie is marginal and within the noise because a
    significant percentage of the file being truncated has been reclaimed
    and consists of shadow entries which reduce the hotness of the
    pagevec_release path.

    Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

    Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently pagevec_lookup_range_tag() takes number of pages to look up
    but most users don't need this. Create a new function
    pagevec_lookup_range_nr_tag() that takes maximum number of pages to
    lookup for Ceph which wants this functionality so that we can drop
    nr_pages argument from pagevec_lookup_range_tag().

    Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "Ranged pagevec tagged lookup", v3.

    In this series I provide a ranged variant of pagevec_lookup_tag() and
    use it in places where it makes sense. This series removes some common
    code and it also has a potential for speeding up some operations
    similarly as for pagevec_lookup_range() (but for now I can think of only
    artificial cases where this happens).

    This patch (of 16):

    Implement a variant of find_get_pages_tag() that stops iterating at
    given index. Lots of users of this function (through pagevec_lookup())
    actually want a range lookup and all of them are currently open-coding
    this.

    Also create corresponding pagevec_lookup_range_tag() function.

    Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Cc: Bob Peterson
    Cc: Chao Yu
    Cc: David Howells
    Cc: David Sterba
    Cc: Ilya Dryomov
    Cc: Jaegeuk Kim
    Cc: Ryusuke Konishi
    Cc: Steve French
    Cc: "Theodore Ts'o"
    Cc: "Yan, Zheng"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

04 Oct, 2017

1 commit

  • MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
    SwapBacked). There is no lock to prevent the page is added to swap
    cache between these two steps by page reclaim. Page reclaim could add
    the page to swap cache and unmap the page. After page reclaim, the page
    is added back to lru. At that time, we probably start draining per-cpu
    pagevec and mark the page lazyfree. So the page could be in a state
    with SwapBacked cleared and PG_swapcache set. Next time there is a
    refault in the virtual address, do_swap_page can find the page from swap
    cache but the page has PageSwapCache false because SwapBacked isn't set,
    so do_swap_page will bail out and do nothing. The task will keep
    running into fault handler.

    Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
    Link: http://lkml.kernel.org/r/6537ef3814398c0073630b03f176263bc81f0902.1506446061.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Reported-by: Artem Savkov
    Tested-by: Artem Savkov
    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

09 Sep, 2017

1 commit

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

3 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement a variant of find_get_pages() that stops iterating at given
    index. This may be substantial performance gain if the mapping is
    sparse. See following commit for details. Furthermore lots of users of
    this function (through pagevec_lookup()) actually want a range lookup
    and all of them are currently open-coding this.

    Also create corresponding pagevec_lookup_range() function.

    Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

11 Jul, 2017

1 commit

  • The rework of the cpu hotplug locking unearthed potential deadlocks with
    the memory hotplug locking code.

    The solution for these is to rework the memory hotplug locking code as
    well and take the cpu hotplug lock before the memory hotplug lock in
    mem_hotplug_begin(), but this will cause a recursive locking of the cpu
    hotplug lock when the memory hotplug code calls lru_add_drain_all().

    Split out the inner workings of lru_add_drain_all() into
    lru_add_drain_all_cpuslocked() so this function can be invoked from the
    memory hotplug code with the cpu hotplug lock held.

    Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
    Signed-off-by: Thomas Gleixner
    Reported-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Vladimir Davydov
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Jul, 2017

1 commit

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

04 May, 2017

1 commit

  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

02 May, 2017

1 commit

  • Pull x86 mm updates from Ingo Molnar:
    "The main x86 MM changes in this cycle were:

    - continued native kernel PCID support preparation patches to the TLB
    flushing code (Andy Lutomirski)

    - various fixes related to 32-bit compat syscall returning address
    over 4Gb in applications, launched from 64-bit binaries - motivated
    by C/R frameworks such as Virtuozzo. (Dmitry Safonov)

    - continued Intel 5-level paging enablement: in particular the
    conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)

    - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)

    - ... plus misc updates, fixes and cleanups"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
    mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
    x86/mm: Fix flush_tlb_page() on Xen
    x86/mm: Make flush_tlb_mm_range() more predictable
    x86/mm: Remove flush_tlb() and flush_tlb_current_task()
    x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
    x86/mm/64: Fix crash in remove_pagetable()
    Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
    x86/boot/e820: Remove a redundant self assignment
    x86/mm: Fix dump pagetables for 4 levels of page tables
    x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
    x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
    Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
    x86/espfix: Add support for 5-level paging
    x86/kasan: Extend KASAN to support 5-level paging
    x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
    x86/paravirt: Add 5-level support to the paravirt code
    x86/mm: Define virtual memory map for 5-level paging
    x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
    x86/boot: Detect 5-level paging support
    x86/mm/numa: Remove numa_nodemask_from_meminfo()
    ...

    Linus Torvalds
     

01 May, 2017

1 commit

  • The x86 conversion to the generic GUP code included a small change which causes
    crashes and data corruption in the pmem code - not good.

    The root cause is that the /dev/pmem driver code implicitly relies on the x86
    get_user_pages() implementation doing a get_page() on the page refcount, because
    get_page() does a get_zone_device_page() which properly refcounts pmem's separate
    page struct arrays that are not present in the regular page struct structures.
    (The pmem driver does this because it can cover huge memory areas.)

    But the x86 conversion to the generic GUP code changed the get_page() to
    page_cache_get_speculative() which is faster but doesn't do the
    get_zone_device_page() call the pmem code relies on.

    One way to solve the regression would be to change the generic GUP code to use
    get_page(), but that would slow things down a bit and punish other generic-GUP
    using architectures for an x86-ism they did not care about. (Arguably the pmem
    driver was probably not working reliably for them: but nvdimm is an Intel
    feature, so non-x86 exposure is probably still limited.)

    So restructure the pmem code's interface with the MM instead: get rid of the
    get/put_zone_device_page() distinction, integrate put_zone_device_page() into
    __put_page() and and restructure the pmem completion-wait and teardown machinery:

    Kirill points out that the calls to {get,put}_dev_pagemap() can be
    removed from the mm fast path if we take a single get_dev_pagemap()
    reference to signify that the page is alive and use the final put of the
    page to drop that reference.

    This does require some care to make sure that any waits for the
    percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
    since it now maintains its own elevated reference.

    This speeds up things while also making the pmem refcounting more robust going
    forward.

    Suggested-by: Kirill Shutemov
    Tested-by: Kirill Shutemov
    Signed-off-by: Dan Williams
    Reviewed-by: Logan Gunthorpe
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Jérôme Glisse
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar

    Dan Williams
     

08 Apr, 2017

1 commit

  • We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
    vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
    per cpu lru caches. This seems more than necessary because both can run
    on a single WQ. Both do not block on locks requiring a memory
    allocation nor perform any allocations themselves. We will save one
    rescuer thread this way.

    On the other hand drain_all_pages() queues work on the system wq which
    doesn't have rescuer and so this depend on memory allocation (when all
    workers are stuck allocating and new ones cannot be created).

    Initially we thought this would be more of a theoretical problem but
    Hugh Dickins has reported:

    : 4.11-rc has been giving me hangs after hours of swapping load. At
    : first they looked like memory leaks ("fork: Cannot allocate memory");
    : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
    : before looking at /proc/meminfo one time, and the stat_refresh stuck
    : in D state, waiting for completion of flush_work like many kworkers.
    : kthreadd waiting for completion of flush_work in drain_all_pages().

    This worker should be using WQ_RECLAIM as well in order to guarantee a
    forward progress. We can reuse the same one as for lru draining and
    vmstat.

    Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Tested-by: Yang Li
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Feb, 2017

1 commit

  • We noticed a performance regression when moving hadoop workloads from
    3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
    activity initiated by kswapd as well as frequent bursts of allocation
    stalls and direct reclaim scans. Even lowering the dirty ratios to the
    equivalent of less than 1% of memory would not eliminate the issue,
    suggesting that dirty pages concentrate where the scanner is looking.

    This can be traced back to recent efforts of thrash avoidance. Where
    3.10 would not detect refaulting pages and continuously supply clean
    cache to the inactive list, a thrashing workload on 4.0+ will detect and
    activate refaulting pages right away, distilling used-once pages on the
    inactive list much more effectively. This is by design, and it makes
    sense for clean cache. But for the most part our workload's cache
    faults are refaults and its use-once cache is from streaming writes. We
    end up with most of the inactive list dirty, and we don't go after the
    active cache as long as we have use-once pages around.

    But waiting for writes to avoid reclaiming clean cache that *might*
    refault is a bad trade-off. Even if the refaults happen, reads are
    faster than writes. Before getting bogged down on writeback, reclaim
    should first look at *all* cache in the system, even active cache.

    To accomplish this, activate pages that are dirty or under writeback
    when they reach the end of the inactive LRU. The pages are marked for
    immediate reclaim, meaning they'll get moved back to the inactive LRU
    tail as soon as they're written back and become reclaimable. But in the
    meantime, by reducing the inactive list to only immediately reclaimable
    pages, we allow the scanner to deactivate and refill the inactive list
    with clean cache from the active list tail to guarantee forward
    progress.

    [hannes@cmpxchg.org: update comment]
    Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Feb, 2017

1 commit

  • The patch is to improve the scalability of the swap out/in via using
    fine grained locks for the swap cache. In current kernel, one address
    space will be used for each swap device. And in the common
    configuration, the number of the swap device is very small (one is
    typical). This causes the heavy lock contention on the radix tree of
    the address space if multiple tasks swap out/in concurrently.

    But in fact, there is no dependency between pages in the swap cache. So
    that, we can split the one shared address space for each swap device
    into several address spaces to reduce the lock contention. In the
    patch, the shared address space is split into 64MB trunks. 64MB is
    chosen to balance the memory space usage and effect of lock contention
    reduction.

    The size of struct address_space on x86_64 architecture is 408B, so with
    the patch, 6528B more memory will be used for every 1GB swap space on
    x86_64 architecture.

    One address space is still shared for the swap entries in the same 64M
    trunks. To avoid lock contention for the first round of swap space
    allocation, the order of the swap clusters in the initial free clusters
    list is changed. The swap space distance between the consecutive swap
    clusters in the free cluster list is at least 64M. After the first
    round of allocation, the swap clusters are expected to be freed
    randomly, so the lock contention should be reduced effectively.

    Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang, Ying
     

26 Dec, 2016

1 commit

  • Add a new page flag, PageWaiters, to indicate the page waitqueue has
    tasks waiting. This can be tested rather than testing waitqueue_active
    which requires another cacheline load.

    This bit is always set when the page has tasks on page_waitqueue(page),
    and is set and cleared under the waitqueue lock. It may be set when
    there are no tasks on the waitqueue, which will cause a harmless extra
    wakeup check that will clears the bit.

    The generic bit-waitqueue infrastructure is no longer used for pages.
    Instead, waitqueues are used directly with a custom key type. The
    generic code was not flexible enough to have PageWaiters manipulation
    under the waitqueue lock (which simplifies concurrency).

    This improves the performance of page lock intensive microbenchmarks by
    2-3%.

    Putting two bits in the same word opens the opportunity to remove the
    memory barrier between clearing the lock bit and testing the waiters
    bit, after some work on the arch primitives (e.g., ensuring memory
    operand widths match and cover both bits).

    Signed-off-by: Nicholas Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

08 Oct, 2016

1 commit

  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

29 Jul, 2016

3 commits

  • With node-lru, the locking is based on the pgdat. Previously it was
    required that a pagevec drain released one zone lru_lock and acquired
    another zone lru_lock on every zone change. Now, it's only necessary if
    the node changes. The end-result is fewer lock release/acquires if the
    pages are all on the same node but in different zones.

    Link: http://lkml.kernel.org/r/1468588165-12461-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

1 commit

  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

25 Jun, 2016

1 commit

  • Currently we can have compound pages held on per cpu pagevecs, which
    leads to a lot of memory unavailable for reclaim when needed. In the
    systems with hundreads of processors it can be GBs of memory.

    On of the way of reproducing the problem is to not call munmap
    explicitly on all mapped regions (i.e. after receiving SIGTERM). After
    that some pages (with THP enabled also huge pages) may end up on
    lru_add_pvec, example below.

    void main() {
    #pragma omp parallel
    {
    size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS
    void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
    if (p != MAP_FAILED)
    memset(p, 0, size);
    //munmap(p, size); // uncomment to make the problem go away
    }
    }

    When we run it with THP enabled it will leave significant amount of
    memory on lru_add_pvec. This memory will be not reclaimed if we hit
    OOM, so when we run above program in a loop:

    for i in `seq 100`; do ./a.out; done

    many processes (95% in my case) will be killed by OOM.

    The primary point of the LRU add cache is to save the zone lru_lock
    contention with a hope that more pages will belong to the same zone and
    so their addition can be batched. The huge page is already a form of
    batched addition (it will add 512 worth of memory in one go) so skipping
    the batching seems like a safer option when compared to a potential
    excess in the caching which can be quite large and much harder to fix
    because lru_add_drain_all is way to expensive and it is not really clear
    what would be a good moment to call it.

    Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
    madvise(p, size, MADV_FREE); after memset.

    This patch flushes lru pvecs on compound page arrival making the problem
    less severe - after applying it kill rate of above example drops to 0%,
    due to reducing maximum amount of memory held on pvec from 28MB (with
    THP) to 56kB per CPU.

    Suggested-by: Michal Hocko
    Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com
    Signed-off-by: Lukasz Odzioba
    Acked-by: Michal Hocko
    Cc: Kirill Shutemov
    Cc: Andrea Arcangeli
    Cc: Vladimir Davydov
    Cc: Ming Li
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lukasz Odzioba
     

10 Jun, 2016

1 commit

  • This patch is based on https://patchwork.ozlabs.org/patch/574623/.

    Tejun submitted commit 23d11a58a9a6 ("workqueue: skip flush dependency
    checks for legacy workqueues") for the legacy create*_workqueue()
    interface.

    But some workq created by alloc_workqueue still reports warning on
    memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set:

    workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c
    ...
    check_flush_dependency+0xb4/0x10c
    flush_work+0x54/0x140
    lru_add_drain_all+0x138/0x188
    migrate_prep+0xc/0x18
    alloc_contig_range+0xf4/0x350
    cma_alloc+0xec/0x1e4
    dma_alloc_from_contiguous+0x38/0x40
    __dma_alloc+0x74/0x25c
    nvme_alloc_queue+0xcc/0x36c
    nvme_reset_work+0x5c4/0xda8
    process_one_work+0x128/0x2ec
    worker_thread+0x58/0x434
    kthread+0xd4/0xe8
    ret_from_fork+0x10/0x50

    That's because lru_add_drain_all() will schedule the drain work on
    system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM.

    Introduce a dedicated WQ_MEM_RECLAIM workqueue to do
    lru_add_drain_all(), aiding in getting memory freed.

    Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com
    Signed-off-by: Wang Sheng-Hui
    Acked-by: Tejun Heo
    Cc: Keith Busch
    Cc: Peter Zijlstra
    Cc: Thierry Reding
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     

21 May, 2016

1 commit


29 Apr, 2016

1 commit

  • Andrea has found[1] a race condition on MMU-gather based TLB flush vs
    split_huge_page() or shrinker which frees huge zero under us (patch 1/2
    and 2/2 respectively).

    With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
    page pinned until flush is complete and the pin prevents the page from
    being split under us.

    We still need patch 2/2. This is simplified version of Andrea's patch.
    We don't need fancy encoding.

    [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Apr, 2016

1 commit