11 Jan, 2021

1 commit


07 Dec, 2020

1 commit

  • We can't call kvfree() with a spin lock held, so defer it. Fixes a
    might_sleep() runtime warning.

    Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc:
    Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
    Signed-off-by: Linus Torvalds

    Qian Cai
     

14 Oct, 2020

5 commits

  • If we failed to drain inode, we would forget to free the swap address
    space allocated by init_swap_address_space() above.

    Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Darrick J. Wong
    Link: https://lkml.kernel.org/r/20200930101803.53884-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • It's unnecessary to goto the out label while out label is just below.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200930102549.1885-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • We don't initially add anon pages to active lruvec after commit
    b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU").
    Remove activate_page() from unuse_pte(), which seems to be missed by the
    commit. And make the function static while we are at it.

    Before the commit, we called lru_cache_add_active_or_unevictable() to add
    new ksm pages to active lruvec. Therefore, activate_page() wasn't
    necessary for them in the first place.

    Signed-off-by: Yu Zhao
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Cc: Alexander Duyck
    Cc: Huang Ying
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Mel Gorman
    Cc: Nicholas Piggin
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • SWP_FS is used to make swap_{read,write}page() go through the filesystem,
    and it's only used for swap files over NFS for now. Otherwise it will
    directly submit IO to blockdev according to swapfile extents reported by
    filesystems in advance.

    As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
    rename to SWP_FS_OPS.

    [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

    Suggested-by: Matthew Wilcox
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull arm64 updates from Will Deacon:
    "There's quite a lot of code here, but much of it is due to the
    addition of a new PMU driver as well as some arm64-specific selftests
    which is an area where we've traditionally been lagging a bit.

    In terms of exciting features, this includes support for the Memory
    Tagging Extension which narrowly missed 5.9, hopefully allowing
    userspace to run with use-after-free detection in production on CPUs
    that support it. Work is ongoing to integrate the feature with KASAN
    for 5.11.

    Another change that I'm excited about (assuming they get the hardware
    right) is preparing the ASID allocator for sharing the CPU page-table
    with the SMMU. Those changes will also come in via Joerg with the
    IOMMU pull.

    We do stray outside of our usual directories in a few places, mostly
    due to core changes required by MTE. Although much of this has been
    Acked, there were a couple of places where we unfortunately didn't get
    any review feedback.

    Other than that, we ran into a handful of minor conflicts in -next,
    but nothing that should post any issues.

    Summary:

    - Userspace support for the Memory Tagging Extension introduced by
    Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

    - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
    switching.

    - Fix and subsequent rewrite of our Spectre mitigations, including
    the addition of support for PR_SPEC_DISABLE_NOEXEC.

    - Support for the Armv8.3 Pointer Authentication enhancements.

    - Support for ASID pinning, which is required when sharing
    page-tables with the SMMU.

    - MM updates, including treating flush_tlb_fix_spurious_fault() as a
    no-op.

    - Perf/PMU driver updates, including addition of the ARM CMN PMU
    driver and also support to handle CPU PMU IRQs as NMIs.

    - Allow prefetchable PCI BARs to be exposed to userspace using normal
    non-cacheable mappings.

    - Implementation of ARCH_STACKWALK for unwinding.

    - Improve reporting of unexpected kernel traps due to BPF JIT
    failure.

    - Improve robustness of user-visible HWCAP strings and their
    corresponding numerical constants.

    - Removal of TEXT_OFFSET.

    - Removal of some unused functions, parameters and prototypes.

    - Removal of MPIDR-based topology detection in favour of firmware
    description.

    - Cleanups to handling of SVE and FPSIMD register state in
    preparation for potential future optimisation of handling across
    syscalls.

    - Cleanups to the SDEI driver in preparation for support in KVM.

    - Miscellaneous cleanups and refactoring work"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    Revert "arm64: initialize per-cpu offsets earlier"
    arm64: random: Remove no longer needed prototypes
    arm64: initialize per-cpu offsets earlier
    kselftest/arm64: Check mte tagged user address in kernel
    kselftest/arm64: Verify KSM page merge for MTE pages
    kselftest/arm64: Verify all different mmap MTE options
    kselftest/arm64: Check forked child mte memory accessibility
    kselftest/arm64: Verify mte tag inclusion via prctl
    kselftest/arm64: Add utilities and a test to validate mte memory
    perf: arm-cmn: Fix conversion specifiers for node type
    perf: arm-cmn: Fix unsigned comparison to less than zero
    arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
    arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
    arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
    arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
    KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
    arm64: Get rid of arm64_ssbd_state
    KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
    KVM: arm64: Get rid of kvm_arm_have_ssbd()
    KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
    ...

    Linus Torvalds
     

27 Sep, 2020

1 commit

  • SWP_FS is used to make swap_{read,write}page() go through the
    filesystem, and it's only used for swap files over NFS. So, !SWP_FS
    means non NFS for now, it could be either file backed or device backed.
    Something similar goes with legacy SWP_FILE.

    So in order to achieve the goal of the original patch, SWP_BLKDEV should
    be used instead.

    FS corruption can be observed with SSD device + XFS + fragmented
    swapfile due to CONFIG_THP_SWAP=y.

    I reproduced the issue with the following details:

    Environment:

    QEMU + upstream kernel + buildroot + NVMe (2 GB)

    Kernel config:

    CONFIG_BLK_DEV_NVME=y
    CONFIG_THP_SWAP=y

    Some reproducible steps:

    mkfs.xfs -f /dev/nvme0n1
    mkdir /tmp/mnt
    mount /dev/nvme0n1 /tmp/mnt
    bs="32k"
    sz="1024m" # doesn't matter too much, I also tried 16m
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

    mkswap /tmp/mnt/sw
    swapon /tmp/mnt/sw

    stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

    Symptoms:
    - FS corruption (e.g. checksum failure)
    - memory corruption at: 0xd2808010
    - segfault

    Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
    Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Reviewed-by: Yang Shi
    Acked-by: Rafael Aquini
    Cc: Matthew Wilcox
    Cc: Carlos Maiolino
    Cc: Eric Sandeen
    Cc: Dave Chinner
    Cc:
    Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     

25 Sep, 2020

2 commits

  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
    decided if ->rw_page can be used on a block device. Just check up for
    the method instead. The only complication is that zram needs a second
    set of block_device_operations as it can switch between modes that
    actually support ->rw_page and those who don't.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Sep, 2020

2 commits

  • swap_type_of is used for two entirely different purposes:

    (1) check what swap type a given device/offset corresponds to
    (2) find the first available swap device that can be written to

    Mixing both in a single function creates an unreadable mess. Create two
    separate functions instead, and switch both to pass a dev_t instead of
    a struct block_device to further simplify the code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Sep, 2020

1 commit

  • Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
    every physical page, when swapping pages out to disk it is necessary to
    save these tags, and later restore them when reading the pages back.

    Add some hooks along with dummy implementations to enable the
    arch code to handle this.

    Three new hooks are added to the swap code:
    * arch_prepare_to_swap() and
    * arch_swap_invalidate_page() / arch_swap_invalidate_area().
    One new hook is added to shmem:
    * arch_swap_restore()

    Signed-off-by: Steven Price
    [catalin.marinas@arm.com: add unlock_page() on the error path]
    [catalin.marinas@arm.com: dropped the _tags suffix]
    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steven Price
     

15 Aug, 2020

2 commits

  • swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,

    === si.highest_bit ===

    write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
    swap_range_alloc+0x81/0x130
    swap_range_alloc at mm/swapfile.c:681
    scan_swap_map_slots+0x371/0xb90
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
    scan_swap_map_slots+0x4a6/0xb90
    scan_swap_map_slots at mm/swapfile.c:892
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    === si.swap_map[offset] ===

    write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
    __swap_entry_free_locked+0x8c/0x100
    __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
    __swap_entry_free.constprop.20+0x69/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
    _swap_info_get+0x81/0xa0
    _swap_info_get at mm/swapfile.c:1140
    free_swap_and_cache+0x40/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    === si.flags ===

    write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
    _swap_info_get+0x41/0xa0
    __swap_info_get at mm/swapfile.c:1114
    put_swap_page+0x84/0x490
    __remove_mapping+0x384/0x5f0
    shrink_page_list+0xff1/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.

    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.

    [cai@lca.pw: add a missing annotation for si->flags in memory.c]
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

2 commits

  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

01 Jul, 2020

1 commit


10 Jun, 2020

2 commits

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

5 commits

  • Right now, users that are otherwise memory controlled can easily escape
    their containment and allocate significant amounts of memory that they're
    not being charged for. That's because swap readahead pages are not being
    charged until somebody actually faults them into their page table. This
    can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
    allocations without charging the pages.

    There are additional problems with the delayed charging of swap pages:

    1. To implement refault/workingset detection for anonymous pages, we
    need to have a target LRU available at swapin time, but the LRU is not
    determinable until the page has been charged.

    2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
    stable when the page is isolated from the LRU; otherwise, the locks
    change under us. But swapcache gets charged after it's already on the
    LRU, and even if we cannot isolate it ourselves (since charging is not
    exactly optional).

    The previous patch ensured we always maintain cgroup ownership records for
    swap pages. This patch moves the swapcache charging point from the fault
    handler to swapin time to fix all of the above problems.

    v2: simplify swapin error checking (Joonsoo)

    [hughd@google.com: fix livelock in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Rafael Aquini
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the page->mapping requirement gone from memcg, we can charge anon and
    file-thp pages in one single step, right after they're allocated.

    This removes two out of three API calls - especially the tricky commit
    step that needed to happen at just the right time between when the page is
    "set up" and when it's "published" - somewhat vague and fluid concepts
    that varied by page type. All we need is a freshly allocated page and a
    memcg context to charge.

    v2: prevent double charges on pre-allocated hugepages in khugepaged

    [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
    Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains a private MEMCG_RSS counter. This divergence from the
    generic VM accounting means unnecessary code overhead, and creates a
    dependency for memcg that page->mapping is set up at the time of charging,
    so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
    lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
    same way we do for NR_FILE_MAPPED.

    With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
    counter, this patch finally eliminates the need to have page->mapping set
    up at charge time. However, we need to have page->mem_cgroup set up by
    the time rmap runs and does the accounting, so switch the commit and the
    rmap callbacks around.

    v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The cgroup swaprate throttling is about matching new anon allocations to
    the rate of available IO when that is being throttled. It's the io
    controller hooking into the VM, rather than a memory controller thing.

    Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
    drop the @memcg argument which is only used to check whether the preceding
    page charge has succeeded and the fault is proceeding.

    We could decouple the call from mem_cgroup_try_charge() here as well, but
    that would cause unnecessary churn: the following patches convert all
    callsites to a new charge API and we'll decouple as we go along.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Jun, 2020

14 commits

  • Fix the heading and Size/Used/Priority field alignments in /proc/swaps.
    If the Size and/or Used value is >= 10000000 (8 bytes), then the
    alignment by using tab characters is broken.

    This patch maintains the use of tabs for alignment. If spaces are
    preferred, we can just use a Field Width specifier for the bytes and
    inuse fields. That way those fields don't have to be a multiple of 8
    bytes in width. E.g., with a field width of 12, both Size and Used
    would always fit on the first line of an 80-column wide terminal (only
    Priority would be on the second line).

    There are actually 2 problems: heading alignment and field width. On an
    xterm, if Used is 7 bytes in length, the tab does nothing, and the
    display is like this, with no space/tab between the Used and Priority
    fields. (ugh)

    Filename Type Size Used Priority
    /dev/sda8 partition 16779260 2023012-1

    To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look
    different, like so:

    Filename Type Size Used Priority
    /dev/sda8 partition 16779260 2086988 -1

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • In some swap scalability test, it is found that there are heavy lock
    contention on swap cache even if we have split one swap cache radix tree
    per swap device to one swap cache radix tree every 64 MB trunk in commit
    4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

    The reason is as follow. After the swap device becomes fragmented so
    that there's no free swap cluster, the swap device will be scanned
    linearly to find the free swap slots. swap_info_struct->cluster_next is
    the next scanning base that is shared by all CPUs. So nearby free swap
    slots will be allocated for different CPUs. The probability for
    multiple CPUs to operate on the same 64 MB trunk is high. This causes
    the lock contention on the swap cache.

    To solve the issue, in this patch, for SSD swap device, a percpu version
    next scanning base (cluster_next_cpu) is added. Every CPU will use its
    own per-cpu next scanning base. And after finishing scanning a 64MB
    trunk, the per-cpu scanning base will be changed to the beginning of
    another randomly selected 64MB trunk. In this way, the probability for
    multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
    Thus the lock contention is reduced too. For HDD, because sequential
    access is more important for IO performance, the original shared next
    scanning base is used.

    To test the patch, we have run 16-process pmbench memory benchmark on a
    2-socket server machine with 48 cores. One ram disk is configured as the
    swap device per socket. The pmbench working-set size is much larger than
    the available memory so that swapping is triggered. The memory read/write
    ratio is 80/20 and the accessing pattern is random. In the original
    implementation, the lock contention on the swap cache is heavy. The perf
    profiling data of the lock contention code path is as following,

    _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
    _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
    _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
    _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93

    After applying this patch, it becomes,

    _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
    _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

    The lock contention on the swap cache is almost eliminated.

    And the pmbench score increases 18.5%. The swapin throughput increases
    18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout throughput increases
    18.5% from 2.99 GB/s to 3.54 GB/s.

    We need really fast disk to show the benefit. I have tried this on 2
    Intel P3600 NVMe disks. The performance improvement is only about 1%.
    The improvement should be better on the faster disks, such as Intel Optane
    disk.

    [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
    Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
    [ying.huang@intel.com: v4]
    Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To improve the code readability and take advantage of the common
    implementation.

    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • __swap_entry_free() always frees 1 entry. Let's remove the usage.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200501015259.32237-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Now, the scalability of swap code will drop much when the swap device
    becomes fragmented, because the swap slots allocation batching stops
    working. To solve the problem, in this patch, we will try to scan a
    little more swap slots with restricted effort to batch the swap slots
    allocation even if the swap device is fragmented. Test shows that the
    benchmark score can increase up to 37.1% with the patch. Details are as
    follows.

    The swap code has a per-cpu cache of swap slots. These batch swap space
    allocations to improve swap subsystem scaling. In the following code
    path,

    add_to_swap()
    get_swap_page()
    refill_swap_slots_cache()
    get_swap_pages()
    scan_swap_map_slots()

    scan_swap_map_slots() and get_swap_pages() can return multiple swap
    slots for each call. These slots will be cached in the per-CPU swap
    slots cache, so that several following swap slot requests will be
    fulfilled there to avoid the lock contention in the lower level swap
    space allocation/freeing code path.

    But this only works when there are free swap clusters. If a swap device
    becomes so fragmented that there's no free swap clusters,
    scan_swap_map_slots() and get_swap_pages() will return only one swap
    slot for each call in the above code path. Effectively, this falls back
    to the situation before the swap slots cache was introduced, the heavy
    lock contention on the swap related locks kills the scalability.

    Why does it work in this way? Because the swap device could be large,
    and the free swap slot scanning could be quite time consuming, to avoid
    taking too much time to scanning free swap slots, the conservative
    method was used.

    In fact, this can be improved via scanning a little more free slots with
    strictly restricted effort. Which is implemented in this patch. In
    scan_swap_map_slots(), after the first free swap slot is gotten, we will
    try to scan a little more, but only if we haven't scanned too many slots
    (< LATENCY_LIMIT). That is, the added scanning latency is strictly
    restricted.

    To test the patch, we have run 16-process pmbench memory benchmark on a
    2-socket server machine with 48 cores. Multiple ram disks are
    configured as the swap devices. The pmbench working-set size is much
    larger than the available memory so that swapping is triggered. The
    memory read/write ratio is 80/20 and the accessing pattern is random, so
    the swap space becomes highly fragmented during the test. In the
    original implementation, the lock contention on swap related locks is
    very heavy. The perf profiling data of the lock contention code path is
    as following,

    _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap: 21.03
    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.92
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.72
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 0.69

    While after applying this patch, it becomes,

    _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 4.89
    _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 3.85
    _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.1
    _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88

    That is, the lock contention on the swap locks is eliminated.

    And the pmbench score increases 37.1%. The swapin throughput increases
    45.7% from 2.02 GB/s to 2.94 GB/s. While the swapout throughput increases
    45.3% from 2.04 GB/s to 2.97 GB/s.

    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Tim Chen
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • There are two duplicate code to handle the case when there is no available
    swap entry. To avoid this, we can compare tmp and max first and let the
    second guard do its job.

    No functional change is expected.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • If tmp is bigger or equal to max, we would jump to new_cluster.

    Return true directly.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200421213824.8099-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • This is not necessary to use the variable found_free to record the status.
    Just check tmp and max is enough.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Tim Chen
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200421213824.8099-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • scan_swap_map_slots() is only called by scan_swap_map() and
    get_swap_pages(). Both ensure nr would not exceed SWAP_BATCH.

    Just remove it.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200325220309.9803-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Use min3() to simplify the comparison and make it more self-explaining.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200325220309.9803-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Now we can see there is redundant goto for SSD case. In these two places,
    we can just let the code walk through to the correct tag instead of
    explicitly jump to it.

    Let's remove them for better readability.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20200328060520.31449-4-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The code shows if this is ssd, it will jump to specific tag and skip the
    following code for non-ssd.

    Let's use "else if" to explicitly show the mutually exclusion for
    ssd/non-ssd to reduce ambiguity.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20200328060520.31449-3-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • scan_swap_map_slots() is used to iterate swap_map[] array for an
    available swap entry. While after several optimizations, e.g. for ssd
    case, the logic of this function is a little not easy to catch.

    This patchset tries to clean up the logic a little:

    * shows the ssd/non-ssd case is handled mutually exclusively
    * remove some unnecessary goto for ssd case

    This patch (of 3):

    When si->cluster_nr is zero, function would reach done and return. The
    increased offset would not be used any more. This means we can move the
    offset increment into the if clause.

    This brings a further code cleanup possibility.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/20200328060520.31449-1-richard.weiyang@gmail.com
    Link: http://lkml.kernel.org/r/20200328060520.31449-2-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In unuse_pte_range() we blindly swap-in pages without checking if the
    swap entry is already present in the swap cache.

    By doing this, the hit/miss ratio used by the swap readahead heuristic
    is not properly updated and this leads to non-optimal performance during
    swapoff.

    Tracing the distribution of the readahead size returned by the swap
    readahead heuristic during swapoff shows that a small readahead size is
    used most of the time as if we had only misses (this happens both with
    cluster and vma readahead), for example:

    r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
    COUNT EVENT
    36948 $retval = 8
    44151 $retval = 4
    49290 $retval = 1
    527771 $retval = 2

    Checking if the swap entry is present in the swap cache, instead, allows
    to properly update the readahead statistics and the heuristic behaves in a
    better way during swapoff, selecting a bigger readahead size:

    r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
    COUNT EVENT
    1618 $retval = 1
    4960 $retval = 2
    41315 $retval = 4
    103521 $retval = 8

    In terms of swapoff performance the result is the following:

    Testing environment
    ===================

    - Host:
    CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache)
    HDD: PC401 NVMe SK hynix 512GB
    MEM: 16GB

    - Guest (kvm):
    8GB of RAM
    virtio block driver
    16GB swap file on ext4 (/swapfile)

    Test case
    =========
    - allocate 85% of memory
    - `systemctl hibernate` to force all the pages to be swapped-out to the
    swap file
    - resume the system
    - measure the time that swapoff takes to complete:
    # /usr/bin/time swapoff /swapfile

    Result (swapoff time)
    ======
    5.6 vanilla 5.6 w/ this patch
    ----------- -----------------
    cluster-readahead 22.09s 12.19s
    vma-readahead 18.20s 15.33s

    Conclusion
    ==========

    The specific use case this patch is addressing is to improve swapoff
    performance in cloud environments when a VM has been hibernated, resumed
    and all the memory needs to be forced back to RAM by disabling swap.

    This change allows to better exploits the advantages of the readahead
    heuristic during swapoff and this improvement allows to to speed up the
    resume process of such VMs.

    [andrea.righi@canonical.com: update changelog]
    Link: http://lkml.kernel.org/r/20200418084705.GA147642@xps-13
    Signed-off-by: Andrea Righi
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Cc: Minchan Kim
    Cc: Anchal Agarwal
    Cc: Hugh Dickins
    Cc: Vineeth Remanan Pillai
    Cc: Kelley Nielsen
    Link: http://lkml.kernel.org/r/20200416180132.GB3352@xps-13
    Signed-off-by: Linus Torvalds

    Andrea Righi