Eric Lee / smarc-fsl-linux-kernel

11 Jan, 2021

1 commit

5e07d2eb0 ANDROID: mm: Export si_swapinfo ... Browse Code »

Export si_swapinfo symbol which is used as part
of meminfo collection from minidump module.

Bug: 176277894
Change-Id: I5dc1672ce649c22dc33d4a544ee5a38f8376becf
Signed-off-by: Vijayanand Jitta

Vijayanand Jitta
2021-01-11 14:05:40 +0800

07 Dec, 2020

1 commit

b11a76b37 mm/swapfile: do not sleep with a spin lock held ... Browse Code »

We can't call kvfree() with a spin lock held, so defer it. Fixes a
might_sleep() runtime warning.

Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc:
Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
Signed-off-by: Linus Torvalds

Qian Cai
2020-12-07 02:19:07 +0800

14 Oct, 2020

5 commits

822bca52e mm/swapfile.c: fix potential memory leak in sys_swapon ... Browse Code »

If we failed to drain inode, we would forget to free the swap address
space allocated by init_swap_address_space() above.

Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Reviewed-by: Darrick J. Wong
Link: https://lkml.kernel.org/r/20200930101803.53884-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-14 09:38:30 +0800
7a3d52e45 mm/swapfile.c: remove unnecessary goto out in _swap_info_get() ... Browse Code »

It's unnecessary to goto the out label while out label is just below.

Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200930102549.1885-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-14 09:38:30 +0800
cc2828b21 mm: remove activate_page() from unuse_pte() ... Browse Code »

We don't initially add anon pages to active lruvec after commit
b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU").
Remove activate_page() from unuse_pte(), which seems to be missed by the
commit. And make the function static while we are at it.

Before the commit, we called lru_cache_add_active_or_unevictable() to add
new ksm pages to active lruvec. Therefore, activate_page() wasn't
necessary for them in the first place.

Signed-off-by: Yu Zhao
Signed-off-by: Andrew Morton
Reviewed-by: Yang Shi
Cc: Alexander Duyck
Cc: Huang Ying
Cc: David Hildenbrand
Cc: Michal Hocko
Cc: Qian Cai
Cc: Mel Gorman
Cc: Nicholas Piggin
Cc: Hugh Dickins
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
Signed-off-by: Linus Torvalds

Yu Zhao
2020-10-14 09:38:30 +0800
326463154 swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity ... Browse Code »

SWP_FS is used to make swap_{read,write}page() go through the filesystem,
and it's only used for swap files over NFS for now. Otherwise it will
directly submit IO to blockdev according to swapfile extents reported by
filesystems in advance.

As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
rename to SWP_FS_OPS.

[1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

Suggested-by: Matthew Wilcox
Signed-off-by: Gao Xiang
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds

Gao Xiang
2020-10-14 09:38:29 +0800
3ad11d7ac Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:

- Series of merge handling cleanups (Baolin, Christoph)

- Series of blk-throttle fixes and cleanups (Baolin)

- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)

- Removal of bdget() as a generic API (Christoph)

- Removal of blkdev_get() as a generic API (Christoph)

- Cleanup of is-partition checks (Christoph)

- Series reworking disk revalidation (Christoph)

- Series cleaning up bio flags (Christoph)

- bio crypt fixes (Eric)

- IO stats inflight tweak (Gabriel)

- blk-mq tags fixes (Hannes)

- Buffer invalidation fixes (Jan)

- Allow soft limits for zone append (Johannes)

- Shared tag set improvements (John, Kashyap)

- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

- DM no-wait support (Mike, Konstantin)

- Request allocation improvements (Ming)

- Allow md/dm/bcache to use IO stat helpers (Song)

- Series improving blk-iocost (Tejun)

- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...

Linus Torvalds
2020-10-14 03:12:44 +0800

13 Oct, 2020

1 commit

6734e20e3 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"There's quite a lot of code here, but much of it is due to the
addition of a new PMU driver as well as some arm64-specific selftests
which is an area where we've traditionally been lagging a bit.

In terms of exciting features, this includes support for the Memory
Tagging Extension which narrowly missed 5.9, hopefully allowing
userspace to run with use-after-free detection in production on CPUs
that support it. Work is ongoing to integrate the feature with KASAN
for 5.11.

Another change that I'm excited about (assuming they get the hardware
right) is preparing the ASID allocator for sharing the CPU page-table
with the SMMU. Those changes will also come in via Joerg with the
IOMMU pull.

We do stray outside of our usual directories in a few places, mostly
due to core changes required by MTE. Although much of this has been
Acked, there were a couple of places where we unfortunately didn't get
any review feedback.

Other than that, we ran into a handful of minor conflicts in -next,
but nothing that should post any issues.

Summary:

- Userspace support for the Memory Tagging Extension introduced by
Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

- Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
switching.

- Fix and subsequent rewrite of our Spectre mitigations, including
the addition of support for PR_SPEC_DISABLE_NOEXEC.

- Support for the Armv8.3 Pointer Authentication enhancements.

- Support for ASID pinning, which is required when sharing
page-tables with the SMMU.

- MM updates, including treating flush_tlb_fix_spurious_fault() as a
no-op.

- Perf/PMU driver updates, including addition of the ARM CMN PMU
driver and also support to handle CPU PMU IRQs as NMIs.

- Allow prefetchable PCI BARs to be exposed to userspace using normal
non-cacheable mappings.

- Implementation of ARCH_STACKWALK for unwinding.

- Improve reporting of unexpected kernel traps due to BPF JIT
failure.

- Improve robustness of user-visible HWCAP strings and their
corresponding numerical constants.

- Removal of TEXT_OFFSET.

- Removal of some unused functions, parameters and prototypes.

- Removal of MPIDR-based topology detection in favour of firmware
description.

- Cleanups to handling of SVE and FPSIMD register state in
preparation for potential future optimisation of handling across
syscalls.

- Cleanups to the SDEI driver in preparation for support in KVM.

- Miscellaneous cleanups and refactoring work"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
Revert "arm64: initialize per-cpu offsets earlier"
arm64: random: Remove no longer needed prototypes
arm64: initialize per-cpu offsets earlier
kselftest/arm64: Check mte tagged user address in kernel
kselftest/arm64: Verify KSM page merge for MTE pages
kselftest/arm64: Verify all different mmap MTE options
kselftest/arm64: Check forked child mte memory accessibility
kselftest/arm64: Verify mte tag inclusion via prctl
kselftest/arm64: Add utilities and a test to validate mte memory
perf: arm-cmn: Fix conversion specifiers for node type
perf: arm-cmn: Fix unsigned comparison to less than zero
arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
arm64: Get rid of arm64_ssbd_state
KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
KVM: arm64: Get rid of kvm_arm_have_ssbd()
KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
...

Linus Torvalds
2020-10-13 01:00:51 +0800

27 Sep, 2020

1 commit

416634305 mm, THP, swap: fix allocating cluster for swapfile by mistake ... Browse Code »

SWP_FS is used to make swap_{read,write}page() go through the
filesystem, and it's only used for swap files over NFS. So, !SWP_FS
means non NFS for now, it could be either file backed or device backed.
Something similar goes with legacy SWP_FILE.

So in order to achieve the goal of the original patch, SWP_BLKDEV should
be used instead.

FS corruption can be observed with SSD device + XFS + fragmented
swapfile due to CONFIG_THP_SWAP=y.

I reproduced the issue with the following details:

Environment:

QEMU + upstream kernel + buildroot + NVMe (2 GB)

Kernel config:

CONFIG_BLK_DEV_NVME=y
CONFIG_THP_SWAP=y

Some reproducible steps:

mkfs.xfs -f /dev/nvme0n1
mkdir /tmp/mnt
mount /dev/nvme0n1 /tmp/mnt
bs="32k"
sz="1024m" # doesn't matter too much, I also tried 16m
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

mkswap /tmp/mnt/sw
swapon /tmp/mnt/sw

stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault

Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: Gao Xiang
Signed-off-by: Andrew Morton
Reviewed-by: "Huang, Ying"
Reviewed-by: Yang Shi
Acked-by: Rafael Aquini
Cc: Matthew Wilcox
Cc: Carlos Maiolino
Cc: Eric Sandeen
Cc: Dave Chinner
Cc:
Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds

Gao Xiang
2020-09-27 01:33:57 +0800

25 Sep, 2020

2 commits

1cb039f3d bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag ... Browse Code »

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800
a8b456d01 bdi: remove BDI_CAP_SYNCHRONOUS_IO ... Browse Code »

BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device. Just check up for
the method instead. The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800

24 Sep, 2020

2 commits

21bd90057 mm: split swap_type_of ... Browse Code »

swap_type_of is used for two entirely different purposes:

(1) check what swap type a given device/offset corresponds to
(2) find the first available swap device that can be written to

Mixing both in a single function creates an unreadable mess. Create two
separate functions instead, and switch both to pass a dev_t instead of
a struct block_device to further simplify the code.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-24 00:43:19 +0800
ef16e1d98 mm: cleanup claim_swapfile ... Browse Code »

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-24 00:43:19 +0800

04 Sep, 2020

1 commit

8a84802e2 mm: Add arch hooks for saving/restoring tags ... Browse Code »

Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
every physical page, when swapping pages out to disk it is necessary to
save these tags, and later restore them when reading the pages back.

Add some hooks along with dummy implementations to enable the
arch code to handle this.

Three new hooks are added to the swap code:
* arch_prepare_to_swap() and
* arch_swap_invalidate_page() / arch_swap_invalidate_area().
One new hook is added to shmem:
* arch_swap_restore()

Signed-off-by: Steven Price
[catalin.marinas@arm.com: add unlock_page() on the error path]
[catalin.marinas@arm.com: dropped the _tags suffix]
Signed-off-by: Catalin Marinas
Acked-by: Andrew Morton

Steven Price
2020-09-04 19:46:07 +0800

15 Aug, 2020

2 commits

a449bf58e mm/swapfile: fix and annotate various data races ... Browse Code »

swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
be accessed concurrently separately as noticed by KCSAN,

=== si.highest_bit ===

write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
swap_range_alloc+0x81/0x130
swap_range_alloc at mm/swapfile.c:681
scan_swap_map_slots+0x371/0xb90
get_swap_pages+0x39d/0x5c0
get_swap_page+0xf2/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290

read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
scan_swap_map_slots+0x4a6/0xb90
scan_swap_map_slots at mm/swapfile.c:892
get_swap_pages+0x39d/0x5c0
get_swap_page+0xf2/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290

Reported by Kernel Concurrency Sanitizer on:
CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

=== si.swap_map[offset] ===

write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
__swap_entry_free_locked+0x8c/0x100
__swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
__swap_entry_free.constprop.20+0x69/0xb0
free_swap_and_cache+0x53/0xa0
unmap_page_range+0x7f8/0x1d70
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0x10e/0x270
do_exit+0x59b/0xf40
do_group_exit+0x8b/0x180

read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
_swap_info_get+0x81/0xa0
_swap_info_get at mm/swapfile.c:1140
free_swap_and_cache+0x40/0xa0
unmap_page_range+0x7f8/0x1d70
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0x10e/0x270
do_exit+0x59b/0xf40
do_group_exit+0x8b/0x180

=== si.flags ===

write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
scan_swap_map_slots+0x6fe/0xb50
scan_swap_map_slots at mm/swapfile.c:887
get_swap_pages+0x39d/0x5c0
get_swap_page+0x377/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290

read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
_swap_info_get+0x41/0xa0
__swap_info_get at mm/swapfile.c:1114
put_swap_page+0x84/0x490
__remove_mapping+0x384/0x5f0
shrink_page_list+0xff1/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290

The writes are under si->lock but the reads are not. For si.highest_bit
and si.swap_map[offset], data race could trigger logic bugs, so fix them
by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
except those isolated reads where they compare against zero which a data
race would cause no harm. Thus, annotate them as intentional data races
using the data_race() macro.

For si.flags, the readers are only interested in a single bit where a
data race there would cause no issue there.

[cai@lca.pw: add a missing annotation for si->flags in memory.c]
Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Cc: Marco Elver
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds

Qian Cai
2020-08-15 10:56:57 +0800
6c357848b mm: replace hpage_nr_pages with thp_nr_pages ... Browse Code »

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-08-15 10:56:56 +0800

13 Aug, 2020

2 commits

3852f6768 mm/swapcache: support to handle the shadow entries ... Browse Code »

Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache. This patch implements an infrastructure to store the shadow
entry in the swapcache.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:57:55 +0800
b518154e5 mm/vmscan: protect the workingset on anonymous LRU ... Browse Code »

In current implementation, newly created or swap-in anonymous page is
started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue. Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list. They are
promoted to active list if enough reference happens. This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:57:55 +0800

01 Jul, 2020

1 commit

e556f6ba1 block: remove the bd_queue field from struct block_device ... Browse Code »

Just use bd_disk->queue instead.

Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-07-01 22:08:20 +0800

10 Jun, 2020

2 commits

d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

04 Jun, 2020

5 commits

4c6355b25 mm: memcontrol: charge swapin pages on instantiation ... Browse Code »

Right now, users that are otherwise memory controlled can easily escape
their containment and allocate significant amounts of memory that they're
not being charged for. That's because swap readahead pages are not being
charged until somebody actually faults them into their page table. This
can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
allocations without charging the pages.

There are additional problems with the delayed charging of swap pages:

1. To implement refault/workingset detection for anonymous pages, we
need to have a target LRU available at swapin time, but the LRU is not
determinable until the page has been charged.

2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
stable when the page is isolated from the LRU; otherwise, the locks
change under us. But swapcache gets charged after it's already on the
LRU, and even if we cannot isolate it ourselves (since charging is not
exactly optional).

The previous patch ensured we always maintain cgroup ownership records for
swap pages. This patch moves the swapcache charging point from the fault
handler to swapin time to fix all of the above problems.

v2: simplify swapin error checking (Joonsoo)

[hughd@google.com: fix livelock in __read_swap_cache_async()]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
Signed-off-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Reviewed-by: Alex Shi
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Balbir Singh
Cc: Rafael Aquini
Cc: Alex Shi
Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:48 +0800
9d82c6943 mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API ... Browse Code »

With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type. All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Joonsoo Kim
Cc: Alex Shi
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Balbir Singh
Cc: Qian Cai
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:48 +0800
be5d0a74c mm: memcontrol: switch to native NR_ANON_MAPPED counter ... Browse Code »

Memcg maintains a private MEMCG_RSS counter. This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time. However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Alex Shi
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:47 +0800
6caa6a070 mm: memcontrol: move out cgroup swaprate throttling ... Browse Code »

The cgroup swaprate throttling is about matching new anon allocations to
the rate of available IO when that is being throttled. It's the io
controller hooking into the VM, rather than a memory controller thing.

Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
drop the @memcg argument which is only used to check whether the preceding
page charge has succeeded and the fault is proceeding.

We could decouple the call from mem_cgroup_try_charge() here as well, but
that would cause unnecessary churn: the following patches convert all
callsites to a new charge API and we'll decouple as we go along.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Alex Shi
Reviewed-by: Joonsoo Kim
Reviewed-by: Shakeel Butt
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:47 +0800
3fba69a56 mm: memcontrol: drop @compound parameter from memcg charging API ... Browse Code »

The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging. The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists. This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along. This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally. That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Alex Shi
Reviewed-by: Joonsoo Kim
Reviewed-by: Shakeel Butt
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:47 +0800

03 Jun, 2020

14 commits

6f7939405 mm: swapfile: fix /proc/swaps heading and Size/Used/Priority alignment ... Browse Code »

Fix the heading and Size/Used/Priority field alignments in /proc/swaps.
If the Size and/or Used value is >= 10000000 (8 bytes), then the
alignment by using tab characters is broken.

This patch maintains the use of tabs for alignment. If spaces are
preferred, we can just use a Field Width specifier for the bytes and
inuse fields. That way those fields don't have to be a multiple of 8
bytes in width. E.g., with a field width of 12, both Size and Used
would always fit on the first line of an 80-column wide terminal (only
Priority would be on the second line).

There are actually 2 problems: heading alignment and field width. On an
xterm, if Used is 7 bytes in length, the tab does nothing, and the
display is like this, with no space/tab between the Used and Priority
fields. (ugh)

Filename Type Size Used Priority
/dev/sda8 partition 16779260 2023012-1

To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look
different, like so:

Filename Type Size Used Priority
/dev/sda8 partition 16779260 2086988 -1

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Alexander Viro
Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.org
Signed-off-by: Linus Torvalds

Randy Dunlap
2020-06-03 01:59:09 +0800
490705888 swap: reduce lock contention on swap cache from swap slots allocation ... Browse Code »

In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix tree
per swap device to one swap cache radix tree every 64 MB trunk in commit
4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow. After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots. swap_info_struct->cluster_next is
the next scanning base that is shared by all CPUs. So nearby free swap
slots will be allocated for different CPUs. The probability for
multiple CPUs to operate on the same 64 MB trunk is high. This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu version
next scanning base (cluster_next_cpu) is added. Every CPU will use its
own per-cpu next scanning base. And after finishing scanning a 64MB
trunk, the per-cpu scanning base will be changed to the beginning of
another randomly selected 64MB trunk. In this way, the probability for
multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
Thus the lock contention is reduced too. For HDD, because sequential
access is more important for IO performance, the original shared next
scanning base is used.

To test the patch, we have run 16-process pmbench memory benchmark on a
2-socket server machine with 48 cores. One ram disk is configured as the
swap device per socket. The pmbench working-set size is much larger than
the available memory so that swapping is triggered. The memory read/write
ratio is 80/20 and the accessing pattern is random. In the original
implementation, the lock contention on the swap cache is heavy. The perf
profiling data of the lock contention code path is as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93

After applying this patch, it becomes,

_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 18.5%. The swapin throughput increases
18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout throughput increases
18.5% from 2.99 GB/s to 3.54 GB/s.

We need really fast disk to show the benefit. I have tried this on 2
Intel P3600 NVMe disks. The performance improvement is only about 1%.
The improvement should be better on the faster disks, such as Intel Optane
disk.

[ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
[ying.huang@intel.com: v4]
Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Tim Chen
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds

Huang Ying
2020-06-03 01:59:09 +0800
09fe06ce0 mm/swapfile.c: use prandom_u32_max() ... Browse Code »

To improve the code readability and take advantage of the common
implementation.

Signed-off-by: "Huang, Ying"
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Minchan Kim
Cc: Tim Chen
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds

Huang Ying
2020-06-03 01:59:09 +0800
33e16272f mm/swapfile.c: __swap_entry_free() always free 1 entry ... Browse Code »

__swap_entry_free() always frees 1 entry. Let's remove the usage.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200501015259.32237-2-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:09 +0800
ed43af109 swap: try to scan more free slots even when fragmented ... Browse Code »

Now, the scalability of swap code will drop much when the swap device
becomes fragmented, because the swap slots allocation batching stops
working. To solve the problem, in this patch, we will try to scan a
little more swap slots with restricted effort to batch the swap slots
allocation even if the swap device is fragmented. Test shows that the
benchmark score can increase up to 37.1% with the patch. Details are as
follows.

The swap code has a per-cpu cache of swap slots. These batch swap space
allocations to improve swap subsystem scaling. In the following code
path,

add_to_swap()
get_swap_page()
refill_swap_slots_cache()
get_swap_pages()
scan_swap_map_slots()

scan_swap_map_slots() and get_swap_pages() can return multiple swap
slots for each call. These slots will be cached in the per-CPU swap
slots cache, so that several following swap slot requests will be
fulfilled there to avoid the lock contention in the lower level swap
space allocation/freeing code path.

But this only works when there are free swap clusters. If a swap device
becomes so fragmented that there's no free swap clusters,
scan_swap_map_slots() and get_swap_pages() will return only one swap
slot for each call in the above code path. Effectively, this falls back
to the situation before the swap slots cache was introduced, the heavy
lock contention on the swap related locks kills the scalability.

Why does it work in this way? Because the swap device could be large,
and the free swap slot scanning could be quite time consuming, to avoid
taking too much time to scanning free swap slots, the conservative
method was used.

In fact, this can be improved via scanning a little more free slots with
strictly restricted effort. Which is implemented in this patch. In
scan_swap_map_slots(), after the first free swap slot is gotten, we will
try to scan a little more, but only if we haven't scanned too many slots
(< LATENCY_LIMIT). That is, the added scanning latency is strictly
restricted.

To test the patch, we have run 16-process pmbench memory benchmark on a
2-socket server machine with 48 cores. Multiple ram disks are
configured as the swap devices. The pmbench working-set size is much
larger than the available memory so that swapping is triggered. The
memory read/write ratio is 80/20 and the accessing pattern is random, so
the swap space becomes highly fragmented during the test. In the
original implementation, the lock contention on swap related locks is
very heavy. The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap: 21.03
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.92
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.72
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 0.69

While after applying this patch, it becomes,

_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 4.89
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 3.85
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.1
_raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88

That is, the lock contention on the swap locks is eliminated.

And the pmbench score increases 37.1%. The swapin throughput increases
45.7% from 2.02 GB/s to 2.94 GB/s. While the swapout throughput increases
45.3% from 2.04 GB/s to 2.97 GB/s.

Signed-off-by: "Huang, Ying"
Signed-off-by: Andrew Morton
Acked-by: Tim Chen
Cc: Dave Hansen
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds

Huang Ying
2020-06-03 01:59:09 +0800
7b9e2de13 mm/swapfile.c: omit a duplicate code by compare tmp and max first ... Browse Code »

There are two duplicate code to handle the case when there is no available
swap entry. To avoid this, we can compare tmp and max first and let the
second guard do its job.

No functional change is expected.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: "Huang, Ying"
Cc: Tim Chen
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:09 +0800
fdff1debb mm/swapfile.c: tmp is always smaller than max ... Browse Code »

If tmp is bigger or equal to max, we would jump to new_cluster.

Return true directly.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: "Huang, Ying"
Cc: Tim Chen
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200421213824.8099-2-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:09 +0800
0fd0e19e4 mm/swapfile.c: found_free could be represented by (tmp < max) ... Browse Code »

This is not necessary to use the variable found_free to record the status.
Just check tmp and max is enough.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: "Huang, Ying"
Cc: Tim Chen
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200421213824.8099-1-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:09 +0800
abca1c84b mm/swapfile.c: remove the extra check in scan_swap_map_slots() ... Browse Code »

scan_swap_map_slots() is only called by scan_swap_map() and
get_swap_pages(). Both ensure nr would not exceed SWAP_BATCH.

Just remove it.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200325220309.9803-2-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:09 +0800
08d3090fc mm/swapfile.c: simplify the calculation of n_goal ... Browse Code »

Use min3() to simplify the comparison and make it more self-explaining.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200325220309.9803-1-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:08 +0800
bd2d18da4 mm/swapfile.c: remove the unnecessary goto for SSD case ... Browse Code »

Now we can see there is redundant goto for SSD case. In these two places,
we can just let the code walk through to the correct tag instead of
explicitly jump to it.

Let's remove them for better readability.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Tim Chen
Link: http://lkml.kernel.org/r/20200328060520.31449-4-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:08 +0800
f4eaf51a7 mm/swapfile.c: explicitly show ssd/non-ssd is handled mutually exclusive ... Browse Code »

The code shows if this is ssd, it will jump to specific tag and skip the
following code for non-ssd.

Let's use "else if" to explicitly show the mutually exclusion for
ssd/non-ssd to reduce ambiguity.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Tim Chen
Link: http://lkml.kernel.org/r/20200328060520.31449-3-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:08 +0800
ca2c55a73 mm/swapfile.c: offset is only used when there is more slots ... Browse Code »

scan_swap_map_slots() is used to iterate swap_map[] array for an
available swap entry. While after several optimizations, e.g. for ssd
case, the logic of this function is a little not easy to catch.

This patchset tries to clean up the logic a little:

* shows the ssd/non-ssd case is handled mutually exclusively
* remove some unnecessary goto for ssd case

This patch (of 3):

When si->cluster_nr is zero, function would reach done and return. The
increased offset would not be used any more. This means we can move the
offset increment into the if clause.

This brings a further code cleanup possibility.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Tim Chen
Link: http://lkml.kernel.org/r/20200328060520.31449-1-richard.weiyang@gmail.com
Link: http://lkml.kernel.org/r/20200328060520.31449-2-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-06-03 01:59:08 +0800
ebc5951ee mm: swap: properly update readahead statistics in unuse_pte_range() ... Browse Code »

In unuse_pte_range() we blindly swap-in pages without checking if the
swap entry is already present in the swap cache.

By doing this, the hit/miss ratio used by the swap readahead heuristic
is not properly updated and this leads to non-optimal performance during
swapoff.

Tracing the distribution of the readahead size returned by the swap
readahead heuristic during swapoff shows that a small readahead size is
used most of the time as if we had only misses (this happens both with
cluster and vma readahead), for example:

r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
36948 $retval = 8
44151 $retval = 4
49290 $retval = 1
527771 $retval = 2

Checking if the swap entry is present in the swap cache, instead, allows
to properly update the readahead statistics and the heuristic behaves in a
better way during swapoff, selecting a bigger readahead size:

r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
1618 $retval = 1
4960 $retval = 2
41315 $retval = 4
103521 $retval = 8

In terms of swapoff performance the result is the following:

Testing environment
===================

- Host:
CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache)
HDD: PC401 NVMe SK hynix 512GB
MEM: 16GB

- Guest (kvm):
8GB of RAM
virtio block driver
16GB swap file on ext4 (/swapfile)

Test case
=========
- allocate 85% of memory
- `systemctl hibernate` to force all the pages to be swapped-out to the
swap file
- resume the system
- measure the time that swapoff takes to complete:
# /usr/bin/time swapoff /swapfile

Result (swapoff time)
======
5.6 vanilla 5.6 w/ this patch
----------- -----------------
cluster-readahead 22.09s 12.19s
vma-readahead 18.20s 15.33s

Conclusion
==========

The specific use case this patch is addressing is to improve swapoff
performance in cloud environments when a VM has been hibernated, resumed
and all the memory needs to be forced back to RAM by disabling swap.

This change allows to better exploits the advantages of the readahead
heuristic during swapoff and this improvement allows to to speed up the
resume process of such VMs.

[andrea.righi@canonical.com: update changelog]
Link: http://lkml.kernel.org/r/20200418084705.GA147642@xps-13
Signed-off-by: Andrea Righi
Signed-off-by: Andrew Morton
Reviewed-by: "Huang, Ying"
Cc: Minchan Kim
Cc: Anchal Agarwal
Cc: Hugh Dickins
Cc: Vineeth Remanan Pillai
Cc: Kelley Nielsen
Link: http://lkml.kernel.org/r/20200416180132.GB3352@xps-13
Signed-off-by: Linus Torvalds

Andrea Righi
2020-06-03 01:59:08 +0800