17 Oct, 2020

1 commit

  • Fix some broken comments including typo, grammar error and wrong function
    name.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200913095456.54873-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

14 Oct, 2020

3 commits

  • SWP_FS is used to make swap_{read,write}page() go through the filesystem,
    and it's only used for swap files over NFS for now. Otherwise it will
    directly submit IO to blockdev according to swapfile extents reported by
    filesystems in advance.

    As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
    rename to SWP_FS_OPS.

    [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

    Suggested-by: Matthew Wilcox
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     
  • There are only four callers remaining of find_get_entry().
    get_shadow_from_swap_cache() only wants to see shadow entries and doesn't
    care about which page is returned. Push the find_subpage() call into
    find_lock_entry(), find_get_incore_page() and pagecache_get_page().

    [willy@infradead.org: fix oops]
    Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Return head pages from find_*_entry", v2.

    This patch series started out as part of the THP patch set, but it has
    some nice effects along the way and it seems worth splitting it out and
    submitting separately.

    Currently find_get_entry() and find_lock_entry() return the page
    corresponding to the requested index, but the first thing most callers do
    is find the head page, which we just threw away. As part of auditing all
    the callers, I found some misuses of the APIs and some plain
    inefficiencies that I've fixed.

    The diffstat is unflattering, but I added more kernel-doc and a new wrapper.

    This patch (of 8);

    Provide this functionality from the swap cache. It's useful for
    more than just mincore().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Jani Nikula
    Cc: Alexey Dobriyan
    Cc: Johannes Weiner
    Cc: Chris Wilson
    Cc: Matthew Auld
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

15 Aug, 2020

2 commits

  • swap_cache_info.* could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache

    write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
    lookup_swap_cache+0x12e/0x460
    lookup_swap_cache at mm/swap_state.c:322
    do_swap_page+0x112/0xeb0
    __handle_mm_fault+0xc7a/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
    lookup_swap_cache+0x117/0x460
    lookup_swap_cache at mm/swap_state.c:322
    shmem_swapin_page+0xc7/0x9e0
    shmem_getpage_gfp+0x2ca/0x16c0
    shmem_fault+0xef/0x3c0
    __do_fault+0x9e/0x220
    do_fault+0x4a0/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
    __delete_from_swap_cache+0x681/0x8b0
    __delete_from_swap_cache at mm/swap_state.c:178

    read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
    __delete_from_swap_cache+0x66e/0x8b0
    __delete_from_swap_cache at mm/swap_state.c:178

    Both the read and write are done as lockless. Since swap_cache_info.*
    are only used to print out counter information, even if any of them
    missed a few incremental due to data races, it will be harmless, so just
    mark it as an intentional data race using the data_race() macro.

    While at it, fix a checkpatch.pl warning,

    WARNING: Single statement macros should not use a do {} while (0) loop

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

2 commits

  • This patch implements workingset detection for anonymous LRU. All the
    infrastructure is implemented by the previous patches so this patch just
    activates the workingset detection by installing/retrieving the shadow
    entry and adding refault calculation.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

1 commit

  • Fix W=1 compile warnings (invalid kerneldoc):

    mm/swap_state.c:742: warning: Function parameter or member 'fentry' not described in 'swap_vma_readahead'
    mm/swap_state.c:742: warning: Excess function parameter 'entry' description in 'swap_vma_readahead'

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200728171109.28687-2-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     

26 Jun, 2020

1 commit

  • Chris Murphy reports that a slightly overcommitted load, testing swap
    and zram along with i915, splats and keeps on splatting, when it had
    better fail less noisily:

    gnome-shell: page allocation failure: order:0,
    mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
    nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
    Call Trace:
    dump_stack+0x64/0x88
    warn_alloc.cold+0x75/0xd9
    __alloc_pages_slowpath.constprop.0+0xcfa/0xd30
    __alloc_pages_nodemask+0x2df/0x320
    alloc_slab_page+0x195/0x310
    allocate_slab+0x3c5/0x440
    ___slab_alloc+0x40c/0x5f0
    __slab_alloc+0x1c/0x30
    kmem_cache_alloc+0x20e/0x220
    xas_nomem+0x28/0x70
    add_to_swap_cache+0x321/0x400
    __read_swap_cache_async+0x105/0x240
    swap_cluster_readahead+0x22c/0x2e0
    shmem_swapin+0x8e/0xc0
    shmem_swapin_page+0x196/0x740
    shmem_getpage_gfp+0x3a2/0xa60
    shmem_read_mapping_page_gfp+0x32/0x60
    shmem_get_pages+0x155/0x5e0 [i915]
    __i915_gem_object_get_pages+0x68/0xa0 [i915]
    i915_vma_pin+0x3fe/0x6c0 [i915]
    eb_add_vma+0x10b/0x2c0 [i915]
    i915_gem_do_execbuffer+0x704/0x3430 [i915]
    i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
    drm_ioctl_kernel+0x86/0xd0 [drm]
    drm_ioctl+0x206/0x390 [drm]
    ksys_ioctl+0x82/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5b/0xf0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported on 5.7, but it goes back really to 3.1: when
    shmem_read_mapping_page_gfp() was implemented for use by i915, and
    allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
    missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
    __read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
    from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
    Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox (Oracle)
    Reported-by: Chris Murphy
    Analyzed-by: Vlastimil Babka
    Analyzed-by: Matthew Wilcox
    Tested-by: Chris Murphy
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

5 commits

  • The VM tries to balance reclaim pressure between anon and file so as to
    reduce the amount of IO incurred due to the memory shortage. It already
    counts refaults and swapins, but in addition it should also count
    writepage calls during reclaim.

    For swap, this is obvious: it's IO that wouldn't have occurred if the
    anonymous memory hadn't been under memory pressure. From a relative
    balancing point of view this makes sense as well: even if anon is cold and
    reclaimable, a cache that isn't thrashing may have equally cold pages that
    don't require IO to reclaim.

    For file writeback, it's trickier: some of the reclaim writepage IO would
    have likely occurred anyway due to dirty expiration. But not all of it -
    premature writeback reduces batching and generates additional writes.
    Since the flushers are already woken up by the time the VM starts writing
    cache pages one by one, let's assume that we'e likely causing writes that
    wouldn't have happened without memory pressure. In addition, the per-page
    cost of IO would have probably been much cheaper if written in larger
    batches from the flusher thread rather than the single-page-writes from
    kswapd.

    For our purposes - getting the trend right to accelerate convergence on a
    stable state that doesn't require paging at all - this is sufficiently
    accurate. If we later wanted to optimize for sustained thrashing, we can
    still refine the measurements.

    Count all writepage calls from kswapd as IO cost toward the LRU that the
    page belongs to.

    Why do this dynamically? Don't we know in advance that anon pages require
    IO to reclaim, and so could build in a static bias?

    First, scanning is not the same as reclaiming. If all the anon pages are
    referenced, we may not swap for a while just because we're scanning the
    anon list. During this time, however, it's important that we age
    anonymous memory and the page cache at the same rate so that their
    hot-cold gradients are comparable. Everything else being equal, we still
    want to reclaim the coldest memory overall.

    Second, we keep copies in swap unless the page changes. If there is
    swap-backed data that's mostly read (tmpfs file) and has been swapped out
    before, we can reclaim it without incurring additional IO.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since the LRUs were split into anon and file lists, the VM has been
    balancing between page cache and anonymous pages based on per-list ratios
    of scanned vs. rotated pages. In most cases that tips page reclaim
    towards the list that is easier to reclaim and has the fewest actively
    used pages, but there are a few problems with it:

    1. Refaults and LRU rotations are weighted the same way, even though
    one costs IO and the other costs a bit of CPU.

    2. The less we scan an LRU list based on already observed rotations,
    the more we increase the sampling interval for new references, and
    rotations become even more likely on that list. This can enter a
    death spiral in which we stop looking at one list completely until
    the other one is all but annihilated by page reclaim.

    Since commit a528910e12ec ("mm: thrash detection-based file cache sizing")
    we have refault detection for the page cache. Along with swapin events,
    they are good indicators of when the file or anon list, respectively, is
    too small for its workingset and needs to grow.

    For example, if the page cache is thrashing, the cache pages need more
    time in memory, while there may be colder pages on the anonymous list.
    Likewise, if swapped pages are faulting back in, it indicates that we
    reclaim anonymous pages too aggressively and should back off.

    Replace LRU rotations with refaults and swapins as the basis for relative
    reclaim cost of the two LRUs. This will have the VM target list balances
    that incur the least amount of IO on aggregate.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • They're the same function, and for the purpose of all callers they are
    equivalent to lru_cache_add().

    [akpm@linux-foundation.org: fix it for local_lock changes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, users that are otherwise memory controlled can easily escape
    their containment and allocate significant amounts of memory that they're
    not being charged for. That's because swap readahead pages are not being
    charged until somebody actually faults them into their page table. This
    can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
    allocations without charging the pages.

    There are additional problems with the delayed charging of swap pages:

    1. To implement refault/workingset detection for anonymous pages, we
    need to have a target LRU available at swapin time, but the LRU is not
    determinable until the page has been charged.

    2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
    stable when the page is isolated from the LRU; otherwise, the locks
    change under us. But swapcache gets charged after it's already on the
    LRU, and even if we cannot isolate it ourselves (since charging is not
    exactly optional).

    The previous patch ensured we always maintain cgroup ownership records for
    swap pages. This patch moves the swapcache charging point from the fault
    handler to swapin time to fix all of the above problems.

    v2: simplify swapin error checking (Joonsoo)

    [hughd@google.com: fix livelock in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Rafael Aquini
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Jun, 2020

1 commit

  • "prev_offset" is a static variable in swapin_nr_pages() that can be
    accessed concurrently with only mmap_sem held in read mode as noticed by
    KCSAN,

    BUG: KCSAN: data-race in swap_cluster_readahead / swap_cluster_readahead

    write to 0xffffffff92763830 of 8 bytes by task 14795 on cpu 17:
    swap_cluster_readahead+0x2a6/0x5e0
    swapin_readahead+0x92/0x8dc
    do_swap_page+0x49b/0xf20
    __handle_mm_fault+0xcfb/0xd70
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x715
    page_fault+0x34/0x40

    1 lock held by (dnf)/14795:
    #0: ffff897bd2e98858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715
    do_user_addr_fault at arch/x86/mm/fault.c:1405
    (inlined by) do_page_fault at arch/x86/mm/fault.c:1535
    irq event stamp: 83493
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x365/0x589
    irq_exit+0xa2/0xc0

    read to 0xffffffff92763830 of 8 bytes by task 1 on cpu 22:
    swap_cluster_readahead+0xfd/0x5e0
    swapin_readahead+0x92/0x8dc
    do_swap_page+0x49b/0xf20
    __handle_mm_fault+0xcfb/0xd70
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x715
    page_fault+0x34/0x40

    1 lock held by systemd/1:
    #0: ffff897c38f14858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715
    irq event stamp: 43530289
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x365/0x589
    irq_exit+0xa2/0xc0

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200402213748.2237-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     

03 Apr, 2020

1 commit

  • add_to_swap_cache() and delete_from_swap_cache() are counterparts, while
    currently they use different ways to count pages.

    It doesn't break anything because we only have two sizes for PageAnon, but
    this is confusing and not good practice.

    This patch corrects it by making both functions use hpage_nr_pages().

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Link: http://lkml.kernel.org/r/20200315012920.2687-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

25 Sep, 2019

2 commits

  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    Kirill and Huang Ying contributed several fixes.

    [willy@infradead.org: use compound_nr, squish uninit-var warning]
    Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-by: Song Liu
    Tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Tested-by: Mikhail Gavrilov
    Cc: Hugh Dickins
    Cc: Chris Wilson
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Jul, 2019

2 commits

  • total_swapcache_pages() may race with swapper_spaces[] allocation and
    freeing. Previously, this is protected with a swapper_spaces[] specific
    RCU mechanism. To simplify the logic/code complexity, it is replaced with
    get/put_swap_device(). The code line number is reduced too. Although not
    so important, the swapoff() performance improves too because one
    synchronize_rcu() call during swapoff() is deleted.

    [ying.huang@intel.com: fix bad swap file entry warning]
    Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
    Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Tested-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Paul E. McKenney
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Cc: Daniel Jordan
    Cc: Andrea Parri
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • When swapin is performed, after getting the swap entry information from
    the page table, system will swap in the swap entry, without any lock held
    to prevent the swap device from being swapoff. This may cause the race
    like below,

    CPU 1 CPU 2
    ----- -----
    do_swap_page
    swapin_readahead
    __read_swap_cache_async
    swapoff swapcache_prepare
    p->swap_map = NULL __swap_duplicate
    p->swap_map[?] /* !!! NULL pointer access */

    Because swapoff is usually done when system shutdown only, the race may
    not hit many people in practice. But it is still a race need to be fixed.

    To fix the race, get_swap_device() is added to check whether the specified
    swap entry is valid in its swap device. If so, it will keep the swap
    entry valid via preventing the swap device from being swapoff, until
    put_swap_device() is called.

    Because swapoff() is very rare code path, to make the normal path runs as
    fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
    reference count is used to implement get/put_swap_device(). >From
    get_swap_device() to put_swap_device(), RCU reader side is locked, so
    synchronize_rcu() in swapoff() will wait until put_swap_device() is
    called.

    In addition to swap_map, cluster_info, etc. data structure in the struct
    swap_info_struct, the swap cache radix tree will be freed after swapoff,
    so this patch fixes the race between swap cache looking up and swapoff
    too.

    Races between some other swap cache usages and swapoff are fixed too via
    calling synchronize_rcu() between clearing PageSwapCache() and freeing
    swap cache data structure.

    Another possible method to fix this is to use preempt_off() +
    stop_machine() to prevent the swap device from being swapoff when its data
    structure is being accessed. The overhead in hot-path of both methods is
    similar. The advantages of RCU based method are,

    1. stop_machine() may disturb the normal execution code path on other
    CPUs.

    2. File cache uses RCU to protect its radix tree. If the similar
    mechanism is used for swap cache too, it is easier to share code
    between them.

    3. RCU is used to protect swap cache in total_swapcache_pages() and
    exit_swap_address_space() already. The two mechanisms can be
    merged to simplify the logic.

    Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Not-nacked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Paul E. McKenney
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

06 Jul, 2019

1 commit

  • This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

    Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
    __delete_from_swap_cache() to trigger:

    page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
    anon
    flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
    raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
    raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
    page dumped because: VM_BUG_ON_PAGE(entry != page)
    page->mem_cgroup:ffff978433ace000
    ------------[ cut here ]------------
    kernel BUG at mm/swap_state.c:170!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
    Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
    RIP: 0010:__delete_from_swap_cache+0x20d/0x240
    Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
    RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
    RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
    RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
    R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
    R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
    FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
    Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

    and it's not immediately obvious why it happens. It's too late in the
    rc cycle to do anything but revert for now.

    Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
    Reported-and-bisected-by: Mikhail Gavrilov
    Suggested-by: Jan Kara
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Kirill Shutemov
    Cc: William Kucharski
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 May, 2019

1 commit

  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    [willy@infradead.org: fix swapcache pages]
    Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
    [kirill@shutemov.name: hugetlb stores pages in page cache differently]
    Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
    Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-and-tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Cc: Hugh Dickins
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Mar, 2019

2 commits

  • swap_vma_readahead()'s comment is missing, just add it.

    Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Minchan Kim
    Cc: Daniel Jordan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Swap readahead would read in a few pages regardless if the underlying
    device is busy or not. It may incur long waiting time if the device is
    congested, and it may also exacerbate the congestion.

    Use inode_read_congested() to check if the underlying device is busy or
    not like what file page readahead does. Get inode from
    swap_info_struct.

    Although we can add inode information in swap_address_space
    (address_space->host), it may lead some unexpected side effect, i.e. it
    may break mapping_cap_account_dirty(). Using inode from
    swap_info_struct seems simple and good enough.

    Just does the check in vma_cluster_readahead() since
    swap_vma_readahead() is just used for non-rotational device which much
    less likely has congestion than traditional HDD.

    Although swap slots may be consecutive on swap partition, it still may
    be fragmented on swap file. This check would help to reduce excessive
    stall for such case.

    The test with page_fault1 of will-it-scale (sometimes tracing may just
    show runtest.py that is the wrapper script of page_fault1), which
    basically launches NR_CPU threads to generate 128MB anonymous pages for
    each thread, on my virtual machine with congested HDD shows long tail
    latency is reduced significantly.

    Without the patch
    page_fault1_thr-1490 [023] 129.311706: funcgraph_entry: #57377.796 us | do_swap_page();
    page_fault1_thr-1490 [023] 129.369103: funcgraph_entry: 5.642us | do_swap_page();
    page_fault1_thr-1490 [023] 129.369119: funcgraph_entry: #1289.592 us | do_swap_page();
    page_fault1_thr-1490 [023] 129.370411: funcgraph_entry: 4.957us | do_swap_page();
    page_fault1_thr-1490 [023] 129.370419: funcgraph_entry: 1.940us | do_swap_page();
    page_fault1_thr-1490 [023] 129.378847: funcgraph_entry: #1411.385 us | do_swap_page();
    page_fault1_thr-1490 [023] 129.380262: funcgraph_entry: 3.916us | do_swap_page();
    page_fault1_thr-1490 [023] 129.380275: funcgraph_entry: #4287.751 us | do_swap_page();

    With the patch
    runtest.py-1417 [020] 301.925911: funcgraph_entry: #9870.146 us | do_swap_page();
    runtest.py-1417 [020] 301.935785: funcgraph_entry: 9.802us | do_swap_page();
    runtest.py-1417 [020] 301.935799: funcgraph_entry: 3.551us | do_swap_page();
    runtest.py-1417 [020] 301.935806: funcgraph_entry: 2.142us | do_swap_page();
    runtest.py-1417 [020] 301.935853: funcgraph_entry: 6.938us | do_swap_page();
    runtest.py-1417 [020] 301.935864: funcgraph_entry: 3.765us | do_swap_page();
    runtest.py-1417 [020] 301.935871: funcgraph_entry: 3.600us | do_swap_page();
    runtest.py-1417 [020] 301.935878: funcgraph_entry: 7.202us | do_swap_page();

    [akpm@linux-foundation.org: code cleanup]
    [yang.shi@linux.alibaba.com: add comment]
    Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Tim Chen
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc: Minchan Kim
    Cc: Daniel Jordan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

1 commit

  • Refaults happen during transitions between workingsets as well as in-place
    thrashing. Knowing the difference between the two has a range of
    applications, including measuring the impact of memory shortage on the
    system performance, as well as the ability to smarter balance pressure
    between the filesystem cache and the swap-backed workingset.

    During workingset transitions, inactive cache refaults and pushes out
    established active cache. When that active cache isn't stale, however,
    and also ends up refaulting, that's bonafide thrashing.

    Introduce a new page flag that tells on eviction whether the page has been
    active or not in its lifetime. This bit is then stored in the shadow
    entry, to classify refaults as transitioning or thrashing.

    How many page->flags does this leave us with on 32-bit?

    20 bits are always page flags

    21 if you have an MMU

    23 with the zone bits for DMA, Normal, HighMem, Movable

    29 with the sparsemem section bits

    30 if PAE is enabled

    31 with this patch.

    So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
    that's not enough, the system can switch to discontigmem and re-gain the 6
    or 7 sparsemem section bits.

    Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Oct, 2018

3 commits


13 Jun, 2018

1 commit

  • The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
    patch replaces cases of:

    kvzalloc(a * b, gfp)

    with:
    kvcalloc(a * b, gfp)

    as well as handling cases of:

    kvzalloc(a * b * c, gfp)

    with:

    kvzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvcalloc(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvzalloc
    + kvcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(sizeof(THING) * C2, ...)
    |
    kvzalloc(sizeof(TYPE) * C2, ...)
    |
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(C1 * C2, ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

1 commit

  • Patch series "mm, memcontrol: Implement memory.swap.events", v2.

    This patchset implements memory.swap.events which contains max and fail
    events so that userland can monitor and respond to swap running out.

    This patch (of 2):

    get_swap_page() is always followed by mem_cgroup_try_charge_swap().
    This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
    makes get_swap_page() call the function even after swap allocation
    failure.

    This simplifies the callers and consolidates memcg related logic and
    will ease adding swap related memcg events.

    Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Apr, 2018

3 commits

  • The bool enable_vma_readahead and swap_vma_readahead() are local to the
    source and do not need to be in global scope, so make them static.

    Cleans up sparse warnings:

    mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
    mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • This patch makes do_swap_page() not need to be aware of two different
    swap readahead algorithms. Just unify cluster-based and vma-based
    readahead function call.

    Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When I see recent change of swap readahead, I am very unhappy about
    current code structure which diverges two swap readahead algorithm in
    do_swap_page. This patch is to clean it up.

    Main motivation is that fault handler doesn't need to be aware of
    readahead algorithms but just should call swapin_readahead.

    As first step, this patch cleans up a little bit but not perfect (I just
    separate for review easier) so next patch will make the goal complete.

    [minchan@kernel.org: do not check readahead flag with THP anon]
    Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Nov, 2017

2 commits

  • All callers of release_pages claim the pages being released are cache
    hot. As no one cares about the hotness of pages being released to the
    allocator, just ditch the parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • These global variables are only set during initialization or rarely
    change, so declare them as __read_mostly.

    Link: http://lkml.kernel.org/r/1507802349-5554-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Changbin Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changbin Du