15 Aug, 2020

2 commits

  • swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,

    === si.highest_bit ===

    write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
    swap_range_alloc+0x81/0x130
    swap_range_alloc at mm/swapfile.c:681
    scan_swap_map_slots+0x371/0xb90
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
    scan_swap_map_slots+0x4a6/0xb90
    scan_swap_map_slots at mm/swapfile.c:892
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    === si.swap_map[offset] ===

    write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
    __swap_entry_free_locked+0x8c/0x100
    __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
    __swap_entry_free.constprop.20+0x69/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
    _swap_info_get+0x81/0xa0
    _swap_info_get at mm/swapfile.c:1140
    free_swap_and_cache+0x40/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    === si.flags ===

    write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
    _swap_info_get+0x41/0xa0
    __swap_info_get at mm/swapfile.c:1114
    put_swap_page+0x84/0x490
    __remove_mapping+0x384/0x5f0
    shrink_page_list+0xff1/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.

    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.

    [cai@lca.pw: add a missing annotation for si->flags in memory.c]
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

2 commits

  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

01 Jul, 2020

1 commit


10 Jun, 2020

2 commits

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

5 commits

  • Right now, users that are otherwise memory controlled can easily escape
    their containment and allocate significant amounts of memory that they're
    not being charged for. That's because swap readahead pages are not being
    charged until somebody actually faults them into their page table. This
    can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
    allocations without charging the pages.

    There are additional problems with the delayed charging of swap pages:

    1. To implement refault/workingset detection for anonymous pages, we
    need to have a target LRU available at swapin time, but the LRU is not
    determinable until the page has been charged.

    2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
    stable when the page is isolated from the LRU; otherwise, the locks
    change under us. But swapcache gets charged after it's already on the
    LRU, and even if we cannot isolate it ourselves (since charging is not
    exactly optional).

    The previous patch ensured we always maintain cgroup ownership records for
    swap pages. This patch moves the swapcache charging point from the fault
    handler to swapin time to fix all of the above problems.

    v2: simplify swapin error checking (Joonsoo)

    [hughd@google.com: fix livelock in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Rafael Aquini
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the page->mapping requirement gone from memcg, we can charge anon and
    file-thp pages in one single step, right after they're allocated.

    This removes two out of three API calls - especially the tricky commit
    step that needed to happen at just the right time between when the page is
    "set up" and when it's "published" - somewhat vague and fluid concepts
    that varied by page type. All we need is a freshly allocated page and a
    memcg context to charge.

    v2: prevent double charges on pre-allocated hugepages in khugepaged

    [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
    Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains a private MEMCG_RSS counter. This divergence from the
    generic VM accounting means unnecessary code overhead, and creates a
    dependency for memcg that page->mapping is set up at the time of charging,
    so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
    lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
    same way we do for NR_FILE_MAPPED.

    With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
    counter, this patch finally eliminates the need to have page->mapping set
    up at charge time. However, we need to have page->mem_cgroup set up by
    the time rmap runs and does the accounting, so switch the commit and the
    rmap callbacks around.

    v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The cgroup swaprate throttling is about matching new anon allocations to
    the rate of available IO when that is being throttled. It's the io
    controller hooking into the VM, rather than a memory controller thing.

    Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
    drop the @memcg argument which is only used to check whether the preceding
    page charge has succeeded and the fault is proceeding.

    We could decouple the call from mem_cgroup_try_charge() here as well, but
    that would cause unnecessary churn: the following patches convert all
    callsites to a new charge API and we'll decouple as we go along.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Jun, 2020

15 commits

  • Fix the heading and Size/Used/Priority field alignments in /proc/swaps.
    If the Size and/or Used value is >= 10000000 (8 bytes), then the
    alignment by using tab characters is broken.

    This patch maintains the use of tabs for alignment. If spaces are
    preferred, we can just use a Field Width specifier for the bytes and
    inuse fields. That way those fields don't have to be a multiple of 8
    bytes in width. E.g., with a field width of 12, both Size and Used
    would always fit on the first line of an 80-column wide terminal (only
    Priority would be on the second line).

    There are actually 2 problems: heading alignment and field width. On an
    xterm, if Used is 7 bytes in length, the tab does nothing, and the
    display is like this, with no space/tab between the Used and Priority
    fields. (ugh)

    Filename Type Size Used Priority
    /dev/sda8 partition 16779260 2023012-1

    To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look
    different, like so:

    Filename Type Size Used Priority
    /dev/sda8 partition 16779260 2086988 -1

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • In some swap scalability test, it is found that there are heavy lock
    contention on swap cache even if we have split one swap cache radix tree
    per swap device to one swap cache radix tree every 64 MB trunk in commit
    4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

    The reason is as follow. After the swap device becomes fragmented so
    that there's no free swap cluster, the swap device will be scanned
    linearly to find the free swap slots. swap_info_struct->cluster_next is
    the next scanning base that is shared by all CPUs. So nearby free swap
    slots will be allocated for different CPUs. The probability for
    multiple CPUs to operate on the same 64 MB trunk is high. This causes
    the lock contention on the swap cache.

    To solve the issue, in this patch, for SSD swap device, a percpu version
    next scanning base (cluster_next_cpu) is added. Every CPU will use its
    own per-cpu next scanning base. And after finishing scanning a 64MB
    trunk, the per-cpu scanning base will be changed to the beginning of
    another randomly selected 64MB trunk. In this way, the probability for
    multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
    Thus the lock contention is reduced too. For HDD, because sequential
    access is more important for IO performance, the original shared next
    scanning base is used.

    To test the patch, we have run 16-process pmbench memory benchmark on a
    2-socket server machine with 48 cores. One ram disk is configured as the
    swap device per socket. The pmbench working-set size is much larger than
    the available memory so that swapping is triggered. The memory read/write
    ratio is 80/20 and the accessing pattern is random. In the original
    implementation, the lock contention on the swap cache is heavy. The perf
    profiling data of the lock contention code path is as following,

    _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
    _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
    _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
    _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93

    After applying this patch, it becomes,

    _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
    _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

    The lock contention on the swap cache is almost eliminated.

    And the pmbench score increases 18.5%. The swapin throughput increases
    18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout throughput increases
    18.5% from 2.99 GB/s to 3.54 GB/s.

    We need really fast disk to show the benefit. I have tried this on 2
    Intel P3600 NVMe disks. The performance improvement is only about 1%.
    The improvement should be better on the faster disks, such as Intel Optane
    disk.

    [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
    Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
    [ying.huang@intel.com: v4]
    Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To improve the code readability and take advantage of the common
    implementation.

    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • __swap_entry_free() always frees 1 entry. Let's remove the usage.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200501015259.32237-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Now, the scalability of swap code will drop much when the swap device
    becomes fragmented, because the swap slots allocation batching stops
    working. To solve the problem, in this patch, we will try to scan a
    little more swap slots with restricted effort to batch the swap slots
    allocation even if the swap device is fragmented. Test shows that the
    benchmark score can increase up to 37.1% with the patch. Details are as
    follows.

    The swap code has a per-cpu cache of swap slots. These batch swap space
    allocations to improve swap subsystem scaling. In the following code
    path,

    add_to_swap()
    get_swap_page()
    refill_swap_slots_cache()
    get_swap_pages()
    scan_swap_map_slots()

    scan_swap_map_slots() and get_swap_pages() can return multiple swap
    slots for each call. These slots will be cached in the per-CPU swap
    slots cache, so that several following swap slot requests will be
    fulfilled there to avoid the lock contention in the lower level swap
    space allocation/freeing code path.

    But this only works when there are free swap clusters. If a swap device
    becomes so fragmented that there's no free swap clusters,
    scan_swap_map_slots() and get_swap_pages() will return only one swap
    slot for each call in the above code path. Effectively, this falls back
    to the situation before the swap slots cache was introduced, the heavy
    lock contention on the swap related locks kills the scalability.

    Why does it work in this way? Because the swap device could be large,
    and the free swap slot scanning could be quite time consuming, to avoid
    taking too much time to scanning free swap slots, the conservative
    method was used.

    In fact, this can be improved via scanning a little more free slots with
    strictly restricted effort. Which is implemented in this patch. In
    scan_swap_map_slots(), after the first free swap slot is gotten, we will
    try to scan a little more, but only if we haven't scanned too many slots
    (< LATENCY_LIMIT). That is, the added scanning latency is strictly
    restricted.

    To test the patch, we have run 16-process pmbench memory benchmark on a
    2-socket server machine with 48 cores. Multiple ram disks are
    configured as the swap devices. The pmbench working-set size is much
    larger than the available memory so that swapping is triggered. The
    memory read/write ratio is 80/20 and the accessing pattern is random, so
    the swap space becomes highly fragmented during the test. In the
    original implementation, the lock contention on swap related locks is
    very heavy. The perf profiling data of the lock contention code path is
    as following,

    _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap: 21.03
    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.92
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.72
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 0.69

    While after applying this patch, it becomes,

    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 4.89
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 3.85
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.1
    _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88

    That is, the lock contention on the swap locks is eliminated.

    And the pmbench score increases 37.1%. The swapin throughput increases
    45.7% from 2.02 GB/s to 2.94 GB/s. While the swapout throughput increases
    45.3% from 2.04 GB/s to 2.97 GB/s.

    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Tim Chen
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • There are two duplicate code to handle the case when there is no available
    swap entry. To avoid this, we can compare tmp and max first and let the
    second guard do its job.

    No functional change is expected.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • If tmp is bigger or equal to max, we would jump to new_cluster.

    Return true directly.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200421213824.8099-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • This is not necessary to use the variable found_free to record the status.
    Just check tmp and max is enough.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200421213824.8099-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • scan_swap_map_slots() is only called by scan_swap_map() and
    get_swap_pages(). Both ensure nr would not exceed SWAP_BATCH.

    Just remove it.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200325220309.9803-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Use min3() to simplify the comparison and make it more self-explaining.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200325220309.9803-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Now we can see there is redundant goto for SSD case. In these two places,
    we can just let the code walk through to the correct tag instead of
    explicitly jump to it.

    Let's remove them for better readability.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20200328060520.31449-4-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The code shows if this is ssd, it will jump to specific tag and skip the
    following code for non-ssd.

    Let's use "else if" to explicitly show the mutually exclusion for
    ssd/non-ssd to reduce ambiguity.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20200328060520.31449-3-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • scan_swap_map_slots() is used to iterate swap_map[] array for an
    available swap entry. While after several optimizations, e.g. for ssd
    case, the logic of this function is a little not easy to catch.

    This patchset tries to clean up the logic a little:

    * shows the ssd/non-ssd case is handled mutually exclusively
    * remove some unnecessary goto for ssd case

    This patch (of 3):

    When si->cluster_nr is zero, function would reach done and return. The
    increased offset would not be used any more. This means we can move the
    offset increment into the if clause.

    This brings a further code cleanup possibility.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20200328060520.31449-1-richard.weiyang@gmail.com
    Link: http://lkml.kernel.org/r/20200328060520.31449-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In unuse_pte_range() we blindly swap-in pages without checking if the
    swap entry is already present in the swap cache.

    By doing this, the hit/miss ratio used by the swap readahead heuristic
    is not properly updated and this leads to non-optimal performance during
    swapoff.

    Tracing the distribution of the readahead size returned by the swap
    readahead heuristic during swapoff shows that a small readahead size is
    used most of the time as if we had only misses (this happens both with
    cluster and vma readahead), for example:

    r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
    COUNT EVENT
    36948 $retval = 8
    44151 $retval = 4
    49290 $retval = 1
    527771 $retval = 2

    Checking if the swap entry is present in the swap cache, instead, allows
    to properly update the readahead statistics and the heuristic behaves in a
    better way during swapoff, selecting a bigger readahead size:

    r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
    COUNT EVENT
    1618 $retval = 1
    4960 $retval = 2
    41315 $retval = 4
    103521 $retval = 8

    In terms of swapoff performance the result is the following:

    Testing environment
    ===================

    - Host:
    CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache)
    HDD: PC401 NVMe SK hynix 512GB
    MEM: 16GB

    - Guest (kvm):
    8GB of RAM
    virtio block driver
    16GB swap file on ext4 (/swapfile)

    Test case
    =========
    - allocate 85% of memory
    - `systemctl hibernate` to force all the pages to be swapped-out to the
    swap file
    - resume the system
    - measure the time that swapoff takes to complete:
    # /usr/bin/time swapoff /swapfile

    Result (swapoff time)
    ======
    5.6 vanilla 5.6 w/ this patch
    ----------- -----------------
    cluster-readahead 22.09s 12.19s
    vma-readahead 18.20s 15.33s

    Conclusion
    ==========

    The specific use case this patch is addressing is to improve swapoff
    performance in cloud environments when a VM has been hibernated, resumed
    and all the memory needs to be forced back to RAM by disabling swap.

    This change allows to better exploits the advantages of the readahead
    heuristic during swapoff and this improvement allows to to speed up the
    resume process of such VMs.

    [andrea.righi@canonical.com: update changelog]
    Link: http://lkml.kernel.org/r/20200418084705.GA147642@xps-13
    Signed-off-by: Andrea Righi
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Minchan Kim
    Cc: Anchal Agarwal
    Cc: Hugh Dickins
    Cc: Vineeth Remanan Pillai
    Cc: Kelley Nielsen
    Link: http://lkml.kernel.org/r/20200416180132.GB3352@xps-13
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • Use list_{prev,next}_entry() instead of list_entry() for better
    code readability.

    Signed-off-by: chenqiwu
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: David Hildenbrand
    Cc: Wei Yang
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Yang Shi
    Cc: Qian Cai
    Cc: Baoquan He
    Link: http://lkml.kernel.org/r/1586599916-15456-2-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Linus Torvalds

    chenqiwu
     

08 Apr, 2020

1 commit

  • Now that "struct proc_ops" exist we can start putting there stuff which
    could not fly with VFS "struct file_operations"...

    Most of fs/proc/inode.c file is dedicated to make open/read/.../close
    reliable in the event of disappearing /proc entries which usually happens
    if module is getting removed. Files like /proc/cpuinfo which never
    disappear simply do not need such protection.

    Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
    "permanent" files.

    Enable "permanent" flag for

    /proc/cpuinfo
    /proc/kmsg
    /proc/modules
    /proc/slabinfo
    /proc/stat
    /proc/sysvipc/*
    /proc/swaps

    More will come once I figure out foolproof way to prevent out module
    authors from marking their stuff "permanent" for performance reasons
    when it is not.

    This should help with scalability: benchmark is "read /proc/cpuinfo R times
    by N threads scattered over the system".

    N R t, s (before) t, s (after)
    -----------------------------------------------------
    64 4096 1.582458 1.530502 -3.2%
    256 4096 6.371926 6.125168 -3.9%
    1024 4096 25.64888 24.47528 -4.6%

    Benchmark source:

    #include
    #include
    #include
    #include

    #include
    #include
    #include
    #include

    const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
    int N;
    const char *filename;
    int R;

    int xxx = 0;

    int glue(int n)
    {
    cpu_set_t m;
    CPU_ZERO(&m);
    CPU_SET(n, &m);
    return sched_setaffinity(0, sizeof(cpu_set_t), &m);
    }

    void f(int n)
    {
    glue(n % NR_CPUS);

    while (*(volatile int *)&xxx == 0) {
    }

    for (int i = 0; i < R; i++) {
    int fd = open(filename, O_RDONLY);
    char buf[4096];
    ssize_t rv = read(fd, buf, sizeof(buf));
    asm volatile ("" :: "g" (rv));
    close(fd);
    }
    }

    int main(int argc, char *argv[])
    {
    if (argc < 4) {
    std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
    ";
    return 1;
    }

    N = atoi(argv[1]);
    filename = argv[2];
    R = atoi(argv[3]);

    for (int i = 0; i < NR_CPUS; i++) {
    if (glue(i) == 0)
    break;
    }

    std::vector T;
    T.reserve(N);
    for (int i = 0; i < N; i++) {
    T.emplace_back(f, i);
    }

    auto t0 = std::chrono::system_clock::now();
    {
    *(volatile int *)&xxx = 1;
    for (auto& t: T) {
    t.join();
    }
    }
    auto t1 = std::chrono::system_clock::now();
    std::chrono::duration dt = t1 - t0;
    std::cout << dt.count() << '
    ';

    return 0;
    }

    P.S.:
    Explicit randomization marker is added because adding non-function pointer
    will silently disable structure layout randomization.

    [akpm@linux-foundation.org: coding style fixes]
    Reported-by: kbuild test robot
    Reported-by: Dan Carpenter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Joe Perches
    Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

03 Apr, 2020

2 commits

  • si->inuse_pages could be accessed concurrently as noticed by KCSAN,

    write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92:
    swap_range_free+0xbe/0x230
    swap_range_free at mm/swapfile.c:719
    swapcache_free_entries+0x1be/0x250
    free_swap_slot+0x1c8/0x220
    __swap_entry_free.constprop.19+0xa3/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7e0/0x1ce0
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0xe7/0x240
    do_exit+0x598/0xfd0
    do_group_exit+0x8b/0x180
    get_signal+0x293/0x13d0
    do_signal+0x37/0x5d0
    prepare_exit_to_usermode+0x1b7/0x2c0
    ret_from_intr+0x32/0x42

    read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46:
    try_to_unuse+0x86b/0xc80
    try_to_unuse at mm/swapfile.c:2185
    __x64_sys_swapoff+0x372/0xd40
    do_syscall_64+0x91/0xb05
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The plain reads in try_to_unuse() are outside si->lock critical section
    which result in data races that could be dangerous to be used in a loop.
    Fix them by adding READ_ONCE().

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The -EEXIST returned by __swap_duplicate means there is a swap cache
    instead -EBUSY

    Signed-off-by: Chen Wandun
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200212145754.27123-1-chenwandun@huawei.com
    Signed-off-by: Linus Torvalds

    Chen Wandun
     

30 Mar, 2020

1 commit

  • claim_swapfile() currently keeps the inode locked when it is successful,
    or the file is already swapfile (with -EBUSY). And, on the other error
    cases, it does not lock the inode.

    This inconsistency of the lock state and return value is quite confusing
    and actually causing a bad unlock balance as below in the "bad_swap"
    section of __do_sys_swapon().

    This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
    check out of claim_swapfile(). The inode is unlocked in
    "bad_swap_unlock_inode" section, so that the inode is ensured to be
    unlocked at "bad_swap". Thus, error handling codes after the locking now
    jumps to "bad_swap_unlock_inode" instead of "bad_swap".

    =====================================
    WARNING: bad unlock balance detected!
    5.5.0-rc7+ #176 Not tainted
    -------------------------------------
    swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
    but there are no more locks to release!

    other info that might help us debug this:
    no locks held by swapon/4294.

    stack backtrace:
    CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
    Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
    Call Trace:
    dump_stack+0xa1/0xea
    print_unlock_imbalance_bug.cold+0x114/0x123
    lock_release+0x562/0xed0
    up_write+0x2d/0x490
    __do_sys_swapon+0x94b/0x3550
    __x64_sys_swapon+0x54/0x80
    do_syscall_64+0xa4/0x4b0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f15da0a0dc7

    Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices")
    Signed-off-by: Naohiro Aota
    Signed-off-by: Andrew Morton
    Tested-by: Qais Youef
    Reviewed-by: Andrew Morton
    Reviewed-by: Darrick J. Wong
    Cc: Christoph Hellwig
    Cc:
    Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
    Signed-off-by: Linus Torvalds

    Naohiro Aota
     

22 Feb, 2020

1 commit


04 Feb, 2020

1 commit

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

01 Feb, 2020

1 commit

  • If seq_file .next fuction does not change position index, read after
    some lseek can generate unexpected output.

    In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
    simplify seq_file iteration code and interface") "Some ->next functions
    do not increment *pos when they return NULL... Note that such ->next
    functions are buggy and should be fixed. A simple demonstration is

    dd if=/proc/swaps bs=1000 skip=1

    Choose any block size larger than the size of /proc/swaps. This will
    always show the whole last line of /proc/swaps"

    Described problem is still actual. If you make lseek into middle of
    last output line following read will output end of last line and whole
    last line once again.

    $ dd if=/proc/swaps bs=1 # usual output
    Filename Type Size Used Priority
    /dev/dm-0 partition 4194812 97536 -2
    104+0 records in
    104+0 records out
    104 bytes copied

    $ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
    dd: /proc/swaps: cannot skip to specified offset
    v/dm-0 partition 4194812 97536 -2
    /dev/dm-0 partition 4194812 97536 -2
    3+1 records in
    3+1 records out
    131 bytes copied

    https://bugzilla.kernel.org/show_bug.cgi?id=206283

    Link: http://lkml.kernel.org/r/bd8cfd7b-ac95-9b91-f9e7-e8438bd5047d@virtuozzo.com
    Signed-off-by: Vasily Averin
    Reviewed-by: Andrew Morton
    Cc: Jann Horn
    Cc: Alexander Viro
    Cc: Kees Cook
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

01 Dec, 2019

1 commit

  • A zoned block device consists of a number of zones. Zones are either
    conventional and accepting random writes or sequential and requiring
    that writes be issued in LBA order from each zone write pointer
    position. For the write restriction, zoned block devices are not
    suitable for a swap device. Disallow swapon on them.

    [akpm@linux-foundation.org: reflow and reword comment, per Christoph]
    Link: http://lkml.kernel.org/r/20191015085814.637837-1-naohiro.aota@wdc.com
    Signed-off-by: Naohiro Aota
    Reviewed-by: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: "Theodore Y. Ts'o"
    Cc: Hannes Reinecke
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naohiro Aota
     

20 Aug, 2019

2 commits


13 Jul, 2019

2 commits

  • swap_extent is used to map swap page offset to backing device's block
    offset. For a continuous block range, one swap_extent is used and all
    these swap_extents are managed in a linked list.

    These swap_extents are used by map_swap_entry() during swap's read and
    write path. To find out the backing device's block offset for a page
    offset, the swap_extent list will be traversed linearly, with
    curr_swap_extent being used as a cache to speed up the search.

    This works well as long as swap_extents are not huge or when the number
    of processes that access swap device are few, but when the swap device
    has many extents and there are a number of processes accessing the swap
    device concurrently, it can be a problem. On one of our servers, the
    disk's remaining size is tight:

    $df -h
    Filesystem Size Used Avail Use% Mounted on
    ... ...
    /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4

    When creating a 80G swapfile there, there are as many as 84656 swap
    extents. The end result is, kernel spends abou 30% time in
    map_swap_entry() and swap throughput is only 70MB/s.

    As a comparison, when I used smaller sized swapfile, like 4G whose
    swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
    map_swap_entry() is about 3%.

    One downside of using rbtree for swap_extent is, 'struct rbtree' takes
    24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
    for each swap_extent. For a swapfile that has 80k swap_extents, that
    means 625KiB more memory consumed.

    Test:

    Since it's not possible to reboot that server, I can not test this patch
    diretly there. Instead, I tested it on another server with NVMe disk.

    I created a 20G swapfile on an NVMe backed XFS fs. By default, the
    filesystem is quite clean and the created swapfile has only 2 extents.
    Testing vanilla and this patch shows no obvious performance difference
    when swapfile is not fragmented.

    To see the patch's effects, I used some tweaks to manually fragment the
    swapfile by breaking the extent at 1M boundary. This made the swapfile
    have 20K extents.

    nr_task=4
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 165191 90.77% 171798 90.21%
    patched 858993 +420% 2.16% 715827 +317% 0.77%

    nr_task=8
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 306783 92.19% 318145 87.76%
    patched 954437 +211% 2.35% 1073741 +237% 1.57%

    swapout: the throughput of swap out, in KB/s, higher is better 1st
    map_swap_entry: cpu cycles percent sampled by perf swapin: the
    throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
    cpu cycles percent sampled by perf

    nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
    can be effectively used to cache the correct swap extent for single task
    workload.

    [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
    Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
    Signed-off-by: Aaron Lu
    Cc: Huang Ying
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • When swapin is performed, after getting the swap entry information from
    the page table, system will swap in the swap entry, without any lock held
    to prevent the swap device from being swapoff. This may cause the race
    like below,

    CPU 1 CPU 2
    ----- -----
    do_swap_page
    swapin_readahead
    __read_swap_cache_async
    swapoff swapcache_prepare
    p->swap_map = NULL __swap_duplicate
    p->swap_map[?] /* !!! NULL pointer access */

    Because swapoff is usually done when system shutdown only, the race may
    not hit many people in practice. But it is still a race need to be fixed.

    To fix the race, get_swap_device() is added to check whether the specified
    swap entry is valid in its swap device. If so, it will keep the swap
    entry valid via preventing the swap device from being swapoff, until
    put_swap_device() is called.

    Because swapoff() is very rare code path, to make the normal path runs as
    fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
    reference count is used to implement get/put_swap_device(). >From
    get_swap_device() to put_swap_device(), RCU reader side is locked, so
    synchronize_rcu() in swapoff() will wait until put_swap_device() is
    called.

    In addition to swap_map, cluster_info, etc. data structure in the struct
    swap_info_struct, the swap cache radix tree will be freed after swapoff,
    so this patch fixes the race between swap cache looking up and swapoff
    too.

    Races between some other swap cache usages and swapoff are fixed too via
    calling synchronize_rcu() between clearing PageSwapCache() and freeing
    swap cache data structure.

    Another possible method to fix this is to use preempt_off() +
    stop_machine() to prevent the swap device from being swapoff when its data
    structure is being accessed. The overhead in hot-path of both methods is
    similar. The advantages of RCU based method are,

    1. stop_machine() may disturb the normal execution code path on other
    CPUs.

    2. File cache uses RCU to protect its radix tree. If the similar
    mechanism is used for swap cache too, it is easier to share code
    between them.

    3. RCU is used to protect swap cache in total_swapcache_pages() and
    exit_swap_address_space() already. The two mechanisms can be
    merged to simplify the logic.

    Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Not-nacked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Paul E. McKenney
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner