26 Jan, 2019

1 commit

  • [ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ]

    Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
    avail_lists field of swap_info_struct is changed to an array with
    MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB
    and needs an order-4 page to hold it.

    This is not optimal in that:
    1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
    is a waste of memory;
    2 It could cause swapon failure if the swap device is swapped on
    after system has been running for a while, due to no order-4
    page is available as pointed out by Vasily Averin.

    Solve the above two issues by using nr_node_ids(which is the actual
    possible node number the running system has) for avail_lists instead of
    MAX_NUMNODES.

    nr_node_ids is unknown at compile time so can't be directly used when
    declaring this array. What I did here is to declare avail_lists as zero
    element array and allocate space for it when allocating space for
    swap_info_struct. The reason why keep using array but not pointer is
    plist_for_each_entry needs the field to be part of the struct, so pointer
    will not work.

    This patch is on top of Vasily Averin's fix commit. I think the use of
    kvzalloc for swap_info_struct is still needed in case nr_node_ids is
    really big on some systems.

    Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
    Signed-off-by: Aaron Lu
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vasily Averin
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Aaron Lu
     

13 Jan, 2019

1 commit

  • commit 7af7a8e19f0c5425ff639b0f0d2d244c2a647724 upstream.

    KSM pages may be mapped to the multiple VMAs that cannot be reached from
    one anon_vma. So during swapin, a new copy of the page need to be
    generated if a different anon_vma is needed, please refer to comments of
    ksm_might_need_to_copy() for details.

    During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
    virtual address mapped to the page, so not all mappings to a swapped out
    KSM page could be found. So in try_to_unuse(), even if the swap count of
    a swap entry isn't zero, the page needs to be deleted from swap cache, so
    that, in the next round a new page could be allocated and swapin for the
    other mappings of the swapped out KSM page.

    But this contradicts with the THP swap support. Where the THP could be
    deleted from swap cache only after the swap count of every swap entry in
    the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
    changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
    space for THP swapped out") to check that before delete a page from swap
    cache, but this has broken KSM swapoff too.

    Fortunately, KSM is for the normal pages only, so the original behavior
    for KSM pages could be restored easily via checking PageTransCompound().
    That is how this patch works.

    The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
    swap space for THP swapped out"), which is merged by v4.14-rc1. So I
    think we should backport the fix to from 4.14 on. But Hugh thinks it may
    be rare for the KSM pages being in the swap device when swapoff, so nobody
    reports the bug so far.

    Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
    Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
    Signed-off-by: "Huang, Ying"
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Daniel Jordan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     

21 Nov, 2018

1 commit

  • commit 873d7bcfd066663e3e50113dc4a0de19289b6354 upstream.

    Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
    changed 'avail_lists' field of 'struct swap_info_struct' to an array.
    In popular linux distros it increased size of swap_info_struct up to 40
    Kbytes and now swap_info_struct allocation requires order-4 page.
    Switch to kvzmalloc allows to avoid unexpected allocation failures.

    Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
    Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
    Signed-off-by: Vasily Averin
    Acked-by: Aaron Lu
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     

16 Aug, 2018

1 commit

  • commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream

    For the L1TF workaround its necessary to limit the swap file size to below
    MAX_PA/2, so that the higher bits of the swap offset inverted never point
    to valid memory.

    Add a mechanism for the architecture to override the swap file size check
    in swapfile.c and add a x86 specific max swapfile check function that
    enforces that limit.

    The check is only enabled if the CPU is vulnerable to L1TF.

    In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
    with 46bit PA it is 32TB. The limit is only per individual swap file, so
    it's always possible to exceed these limits with multiple swap files or
    partitions.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

30 May, 2018

1 commit

  • [ Upstream commit a06ad633a37c64a0cd4c229fc605cee8725d376e ]

    Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tom Abraham
     

03 Nov, 2017

1 commit

  • One page may store a set of entries of the sis->swap_map
    (swap_info_struct->swap_map) in multiple swap clusters.

    If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
    multiple pages will be used to store the set of entries of the
    sis->swap_map. And the pages are linked with page->lru. This is called
    swap count continuation. To access the pages which store the set of
    entries of the sis->swap_map simultaneously, previously, sis->lock is
    used. But to improve the scalability of __swap_duplicate(), swap
    cluster lock may be used in swap_count_continued() now. This may race
    with add_swap_count_continuation() which operates on a nearby swap
    cluster, in which the sis->swap_map entries are stored in the same page.

    The race can cause wrong swap count in practice, thus cause unfreeable
    swap entries or software lockup, etc.

    To fix the race, a new spin lock called cont_lock is added to struct
    swap_info_struct to protect the swap count continuation page list. This
    is a lock at the swap device level, so the scalability isn't very well.
    But it is still much better than the original sis->lock, because it is
    only acquired/released when swap count continuation is used. Which is
    considered rare in practice. If it turns out that the scalability
    becomes an issue for some workloads, we can split the lock into some
    more fine grained locks.

    Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Aaron Lu
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 Sep, 2017

2 commits

  • Free frontswap_map if an error is encountered before enable_swap_info().

    Signed-off-by: David Rientjes
    Reviewed-by: "Huang, Ying"
    Cc: Darrick J. Wong
    Cc: Hugh Dickins
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If initializing a small swap file fails because the swap file has a
    problem (holes, etc.) then we need to free the cluster info as part of
    cleanup. Unfortunately a previous patch changed the code to use kvzalloc
    but did not change all the vfree calls to use kvfree.

    Found by running generic/357 from xfstests.

    Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia
    Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: "Huang, Ying"
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

07 Sep, 2017

7 commits

  • If the system has more than one swap device and swap device has the node
    information, we can make use of this information to decide which swap
    device to use in get_swap_pages() to get better performance.

    The current code uses a priority based list, swap_avail_list, to decide
    which swap device to use and if multiple swap devices share the same
    priority, they are used round robin. This patch changes the previous
    single global swap_avail_list into a per-numa-node list, i.e. for each
    numa node, it sees its own priority based list of available swap
    devices. Swap device's priority can be promoted on its matching node's
    swap_avail_list.

    The current swap device's priority is set as: user can set a >=0 value,
    or the system will pick one starting from -1 then downwards. The
    priority value in the swap_avail_list is the negated value of the swap
    device's due to plist being sorted from low to high. The new policy
    doesn't change the semantics for priority >=0 cases, the previous
    starting from -1 then downwards now becomes starting from -2 then
    downwards and -1 is reserved as the promoted value.

    Take 4-node EX machine as an example, suppose 4 swap devices are
    available, each sit on a different node:
    swapA on node 0
    swapB on node 1
    swapC on node 2
    swapD on node 3

    After they are all swapped on in the sequence of ABCD.

    Current behaviour:
    their priorities will be:
    swapA: -1
    swapB: -2
    swapC: -3
    swapD: -4
    And their position in the global swap_avail_list will be:
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:2 prio:3 prio:4

    New behaviour:
    their priorities will be(note that -1 is skipped):
    swapA: -2
    swapB: -3
    swapC: -4
    swapD: -5
    And their positions in the 4 swap_avail_lists[nid] will be:
    swap_avail_lists[0]: /* node 0's available swap device list */
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:3 prio:4 prio:5
    swap_avali_lists[1]: /* node 1's available swap device list */
    swapB -> swapA -> swapC -> swapD
    prio:1 prio:2 prio:4 prio:5
    swap_avail_lists[2]: /* node 2's available swap device list */
    swapC -> swapA -> swapB -> swapD
    prio:1 prio:2 prio:3 prio:5
    swap_avail_lists[3]: /* node 3's available swap device list */
    swapD -> swapA -> swapB -> swapC
    prio:1 prio:2 prio:3 prio:4

    To see the effect of the patch, a test that starts N process, each mmap
    a region of anonymous memory and then continually write to it at random
    position to trigger both swap in and out is used.

    On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
    are used as swap devices with each attached to a different node, the
    result is:

    runtime=30m/processes=32/total test size=128G/each process mmap region=4G
    kernel throughput
    vanilla 13306
    auto-binding 15169 +14%

    runtime=30m/processes=64/total test size=128G/each process mmap region=2G
    kernel throughput
    vanilla 11885
    auto-binding 14879 +25%

    [aaron.lu@intel.com: v2]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    [akpm@linux-foundation.org: use kmalloc_array()]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    Signed-off-by: Aaron Lu
    Cc: "Chen, Tim C"
    Cc: Huang Ying
    Cc: Andi Kleen
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • VMA based swap readahead will readahead the virtual pages that is
    continuous in the virtual address space. While the original swap
    readahead will readahead the swap slots that is continuous in the swap
    device. Although VMA based swap readahead is more correct for the swap
    slots to be readahead, it will trigger more small random readings, which
    may cause the performance of HDD (hard disk) to degrade heavily, and may
    finally exceed the benefit.

    To avoid the issue, in this patch, if the HDD is used as swap, the VMA
    based swap readahead will be disabled, and the original swap readahead
    will be used instead.

    Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After adding swapping out support for THP (Transparent Huge Page), it is
    possible that a THP in swap cache (partly swapped out) need to be split.
    To split such a THP, the swap cluster backing the THP need to be split
    too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap
    cluster. The patch implemented this.

    And because the THP swap writing needs the THP keeps as huge page during
    writing. The PageWriteback flag is checked before splitting.

    Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • It's hard to write a whole transparent huge page (THP) to a file backed
    swap device during swapping out and the file backed swap device isn't
    very popular. So the huge cluster allocation for the file backed swap
    device is disabled.

    Link: http://lkml.kernel.org/r/20170724051840.2309-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After supporting to delay THP (Transparent Huge Page) splitting after
    swapped out, it is possible that some page table mappings of the THP are
    turned into swap entries. So reuse_swap_page() need to check the swap
    count in addition to the map count as before. This patch done that.

    In the huge PMD write protect fault handler, in addition to the page map
    count, the swap count need to be checked too, so the page lock need to
    be acquired too when calling reuse_swap_page() in addition to the page
    table lock.

    [ying.huang@intel.com: silence a compiler warning]
    Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The normal swap slot reclaiming can be done when the swap count reaches
    SWAP_HAS_CACHE. But for the swap slot which is backing a THP, all swap
    slots backing one THP must be reclaimed together, because the swap slot
    may be used again when the THP is swapped out again later. So the swap
    slots backing one THP can be reclaimed together when the swap count for
    all swap slots for the THP reached SWAP_HAS_CACHE. In the patch, the
    functions to check whether the swap count for all swap slots backing one
    THP reached SWAP_HAS_CACHE are implemented and used when checking
    whether a swap slot can be reclaimed.

    To make it easier to determine whether a swap slot is backing a THP, a
    new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
    cluster which is backing a THP (Transparent Huge Page). Because THP
    swap in as a whole isn't supported now. After deleting the THP from the
    swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
    flag will be cleared. So that, the normal pages inside THP can be
    swapped in individually.

    [ying.huang@intel.com: fix swap_page_trans_huge_swapped on HDD]
    Link: http://lkml.kernel.org/r/874ltsm0bi.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20170724051840.2309-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "mm, THP, swap: Delay splitting THP after swapped out", v3.

    This is the second step of THP (Transparent Huge Page) swap
    optimization. In the first step, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. In the second
    step, the splitting is delayed further to after the swapping out
    finished. The plan is to delay splitting THP step by step, finally
    avoid splitting THP for the THP swapping out and swap out/in the THP as
    a whole.

    In the patchset, more operations for the anonymous THP reclaiming, such
    as TLB flushing, writing the THP to the swap device, removing the THP
    from the swap cache are batched. So that the performance of anonymous
    THP swapping out are improved.

    During the development, the following scenarios/code paths have been
    checked,

    - swap out/in
    - swap off
    - write protect page fault
    - madvise_free
    - process exit
    - split huge page

    With the patchset, the swap out throughput improves 42% (from about
    5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
    with 16 processes. At the same time, the IPI (reflect TLB flushing)
    reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Below is the part of the cover letter for the first step patchset of THP
    swap optimization which applies to all steps.

    =========================

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce TLB flushing and lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary.

    For example, it can be selected via "always/never/madvise" logic, to be
    turned on globally, turned off globally, or turned on only for VMA with
    MADV_HUGEPAGE, etc.

    This patch (of 12):

    Previously, swapcache_free_cluster() is used only in the error path of
    shrink_page_list() to free the swap cluster just allocated if the THP
    (Transparent Huge Page) is failed to be split. In this patch, it is
    enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
    cluster that holds the contents of THP swapped out.

    This will be used in delaying splitting THP after swapping out support.
    Because there is no THP swapping in as a whole support yet, after
    clearing the swap cache flag, the swap cluster backing the THP swapped
    out will be split. So that the swap slots in the swap cluster can be
    swapped in as normal pages later.

    Link: http://lkml.kernel.org/r/20170724051840.2309-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

11 Jul, 2017

1 commit

  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Jul, 2017

3 commits

  • To reduce the lock contention of swap_info_struct->lock when freeing
    swap entry. The freed swap entries will be collected in a per-CPU
    buffer firstly, and be really freed later in batch. During the batch
    freeing, if the consecutive swap entries in the per-CPU buffer belongs
    to same swap device, the swap_info_struct->lock needs to be
    acquired/released only once, so that the lock contention could be
    reduced greatly. But if there are multiple swap devices, it is possible
    that the lock may be unnecessarily released/acquired because the swap
    entries belong to the same swap device are non-consecutive in the
    per-CPU buffer.

    To solve the issue, the per-CPU buffer is sorted according to the swap
    device before freeing the swap entries.

    With the patch, the memory (some swapped out) free time reduced 11.6%
    (from 2.65s to 2.35s) in the vm-scalability swap-w-rand test case with
    16 processes. The test is done on a Xeon E5 v3 system. The swap device
    used is a RAM simulated PMEM (persistent memory) device. To test
    swapping, the test case creates 16 processes, which allocate and write
    to the anonymous pages until the RAM and part of the swap device is used
    up, finally the memory (some swapped out) is freed before exit.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/20170525005916.25249-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 May, 2017

1 commit

  • Now vzalloc() is used in swap code to allocate various data structures,
    such as swap cache, swap slots cache, cluster info, etc. Because the
    size may be too large on some system, so that normal kzalloc() may fail.
    But using kzalloc() has some advantages, for example, less memory
    fragmentation, less TLB pressure, etc. So change the data structure
    allocation in swap code to use kvzalloc() which will try kzalloc()
    firstly, and fallback to vzalloc() if kzalloc() failed.

    In general, although kmalloc() will reduce the number of high-order
    pages in short term, vmalloc() will cause more pain for memory
    fragmentation in the long term. And the swap data structure allocation
    that is changed in this patch is expected to be long term allocation.

    From Dave Hansen:
    "for example, we have a two-page data structure. vmalloc() takes two
    effectively random order-0 pages, probably from two different 2M pages
    and pins them. That "kills" two 2M pages. kmalloc(), allocating two
    *contiguous* pages, will not cross a 2M boundary. That means it will
    only "kill" the possibility of a single 2M page. More 2M pages == less
    fragmentation.

    The allocation in this patch occurs during swap on time, which is
    usually done during system boot, so usually we have high opportunity to
    allocate the contiguous pages successfully.

    The allocation for swap_map[] in struct swap_info_struct is not changed,
    because that is usually quite large and vmalloc_to_page() is used for
    it. That makes it a little harder to change.

    Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

04 May, 2017

4 commits

  • In swapcache_free_entries(), if swap_info_get_cont() returns NULL,
    something wrong occurs for the swap entry. But we should still continue
    to free the following swap entries in the array instead of skip them to
    avoid swap space leak. This is just problem in error path, where system
    may be in an inconsistent state, but it is still good to fix it.

    Link: http://lkml.kernel.org/r/20170421124739.24534-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Tim Chen
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Cluster lock is used to protect the swap_cluster_info and corresponding
    elements in swap_info_struct->swap_map[]. But it is found that now in
    scan_swap_map_slots(), swap_avail_lock may be acquired when cluster lock
    is held. This does no good except making the locking more complex and
    improving the potential locking contention, because the
    swap_info_struct->lock is used to protect the data structure operated in
    the code already. Fix this via moving the corresponding operations in
    scan_swap_map_slots() out of cluster lock.

    Link: http://lkml.kernel.org/r/20170317064635.12792-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This is just a cleanup patch, no functionality change.

    In cluster_list_add_tail(), spin_lock_nested() is used to lock the
    cluster, while unlock_cluster() is used to unlock the cluster. To
    improve the code readability, use spin_unlock() directly to unlock the
    cluster.

    Link: http://lkml.kernel.org/r/20170317064635.12792-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Before using cluster lock in free_swap_and_cache(), the
    swap_info_struct->lock will be held during freeing the swap entry and
    acquiring page lock, so the page swap count will not change when testing
    page information later. But after using cluster lock, the cluster lock
    (or swap_info_struct->lock) will be held only during freeing the swap
    entry. So before acquiring the page lock, the page swap count may be
    changed in another thread. If the page swap count is not 0, we should
    not delete the page from the swap cache. This is fixed via checking
    page swap count again after acquiring the page lock.

    I found the race when I review the code, so I didn't trigger the race
    via a test program. If the race occurs for an anonymous page shared by
    multiple processes via fork, multiple pages will be allocated and
    swapped in from the swap device for the previously shared one page.
    That is, the user-visible runtime effect is more memory will be used and
    the access latency for the page will be higher, that is, the performance
    regression.

    Link: http://lkml.kernel.org/r/20170301143905.12846-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

10 Mar, 2017

1 commit


02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

2 commits

  • We already have the helper, we can convert the rest of the kernel
    mechanically using:

    git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    Link: http://lkml.kernel.org/r/20161218123229.22952-3-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
    git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-2-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

23 Feb, 2017

8 commits

  • Initialize swap slots cache and enable it on swap on. Drain all swap
    slots on swap off.

    Link: http://lkml.kernel.org/r/07cbc94882fa95d4ac3cfc50b8dce0b1ec231b93.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Cc: "Huang, Ying"
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • We add per cpu caches for swap slots that can be allocated and freed
    quickly without the need to touch the swap info lock.

    Two separate caches are maintained for swap slots allocated and swap
    slots returned. This is to allow the swap slots to be returned to the
    global pool in a batch so they will have a chance to be coaelesced with
    other slots in a cluster. We do not reuse the slots that are returned
    right away, as it may increase fragmentation of the slots.

    The swap allocation cache is protected by a mutex as we may sleep when
    searching for empty slots in cache. The swap free cache is protected by
    a spin lock as we cannot sleep in the free path.

    We refill the swap slots cache when we run out of slots, and we disable
    the swap slots cache and drain the slots if the global number of slots
    fall below a low watermark threshold. We re-enable the cache agian when
    the slots available are above a high watermark.

    [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
    [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
    Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
    Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Michal Hocko
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • Add new functions that free unused swap slots in batches without the
    need to reacquire swap info lock. This improves scalability and reduce
    lock contention.

    Link: http://lkml.kernel.org/r/c25e0fcdfd237ec4ca7db91631d3b9f6ed23824e.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • Currently, the swap slots are allocated one page at a time, causing
    contention to the swap_info lock protecting the swap partition on every
    page being swapped.

    This patch adds new functions get_swap_pages and scan_swap_map_slots to
    request multiple swap slots at once. This will reduces the lock
    contention on the swap_info lock. Also scan_swap_map_slots can operate
    more efficiently as swap slots often occurs in clusters close to each
    other on a swap device and it is quicker to allocate them together.

    Link: http://lkml.kernel.org/r/9fec2845544371f62c3763d43510045e33d286a6.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • We can avoid needlessly allocating page for swap slots that are not used
    by anyone. No pages have to be read in for these slots.

    Link: http://lkml.kernel.org/r/0784b3f20b9bd3aa5552219624cb78dc4ae710c9.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • The patch is to improve the scalability of the swap out/in via using
    fine grained locks for the swap cache. In current kernel, one address
    space will be used for each swap device. And in the common
    configuration, the number of the swap device is very small (one is
    typical). This causes the heavy lock contention on the radix tree of
    the address space if multiple tasks swap out/in concurrently.

    But in fact, there is no dependency between pages in the swap cache. So
    that, we can split the one shared address space for each swap device
    into several address spaces to reduce the lock contention. In the
    patch, the shared address space is split into 64MB trunks. 64MB is
    chosen to balance the memory space usage and effect of lock contention
    reduction.

    The size of struct address_space on x86_64 architecture is 408B, so with
    the patch, 6528B more memory will be used for every 1GB swap space on
    x86_64 architecture.

    One address space is still shared for the swap entries in the same 64M
    trunks. To avoid lock contention for the first round of swap space
    allocation, the order of the swap clusters in the initial free clusters
    list is changed. The swap space distance between the consecutive swap
    clusters in the free cluster list is at least 64M. After the first
    round of allocation, the swap clusters are expected to be freed
    randomly, so the lock contention should be reduced effectively.

    Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang, Ying
     
  • This patch is to reduce the lock contention of swap_info_struct->lock
    via using a more fine grained lock in swap_cluster_info for some swap
    operations. swap_info_struct->lock is heavily contended if multiple
    processes reclaim pages simultaneously. Because there is only one lock
    for each swap device. While in common configuration, there is only one
    or several swap devices in the system. The lock protects almost all
    swap related operations.

    In fact, many swap operations only access one element of
    swap_info_struct->swap_map array. And there is no dependency between
    different elements of swap_info_struct->swap_map. So a fine grained
    lock can be used to allow parallel access to the different elements of
    swap_info_struct->swap_map.

    In this patch, a spinlock is added to swap_cluster_info to protect the
    elements of swap_info_struct->swap_map in the swap cluster and the
    fields of swap_cluster_info. This reduced locking contention for
    swap_info_struct->swap_map access greatly.

    Because of the added spinlock, the size of swap_cluster_info increases
    from 4 bytes to 8 bytes on the 64 bit and 32 bit system. This will use
    additional 4k RAM for every 1G swap space.

    Because the size of swap_cluster_info is much smaller than the size of
    the cache line (8 vs 64 on x86_64 architecture), there may be false
    cache line sharing between spinlocks in swap_cluster_info. To avoid the
    false sharing in the first round of the swap cluster allocation, the
    order of the swap clusters in the free clusters list is changed. So
    that, the swap_cluster_info sharing the same cache line will be placed
    as far as possible. After the first round of allocation, the order of
    the clusters in free clusters list is expected to be random. So the
    false sharing should be not serious.

    Compared with a previous implementation using bit_spin_lock, the
    sequential swap out throughput improved about 3.2%. Test was done on a
    Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
    (persistent memory) device. To test the sequential swapping out, the
    test case created 32 processes, which sequentially allocate and write to
    the anonymous pages until the RAM and part of the swap device is used.

    [ying.huang@intel.com: v5]
    Link: http://lkml.kernel.org/r/878tqeuuic.fsf_-_@yhuang-dev.intel.com
    [minchan@kernel.org: initialize spinlock for swap_cluster_info]
    Link: http://lkml.kernel.org/r/1486434945-29753-1-git-send-email-minchan@kernel.org
    [hughd@google.com: annotate nested locking for cluster lock]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702161050540.21773@eggly.anvils
    Link: http://lkml.kernel.org/r/dbb860bbd825b1aaba18988015e8963f263c3f0d.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Signed-off-by: Minchan Kim
    Signed-off-by: Hugh Dickins
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang, Ying
     
  • Patch series "mm/swap: Regular page swap optimizations", v5.

    Times have changed. Coming generation of Solid state Block device
    latencies are getting down to sub 100 usec, which is within an order of
    magnitude of DRAM, and their performance is orders of magnitude higher
    than the single- spindle rotational media we've swapped to historically.

    This could benefit many usage scenearios. For example cloud providers
    who overcommit their memory (as VM don't use all the memory
    provisioned). Having a fast swap will allow them to be more aggressive
    in memory overcommit and fit more VMs to a platform.

    In our testing [see footnote], the median latency that the kernel adds
    to a page fault is 15 usec, which comes quite close to the amount that
    will be contributed by the underlying I/O devices.

    The software latency comes mostly from contentions on the locks
    protecting the radix tree of the swap cache and also the locks
    protecting the individual swap devices. The lock contentions already
    consumed 35% of cpu cycles in our test. In the very near future,
    software latency will become the bottleneck to swap performnace as block
    device I/O latency gets within the shouting distance of DRAM speed.

    This patch set, reduced the median page fault latency from 15 usec to 4
    usec (375% reduction) for DRAM based pmem block device.

    This patch (of 9):

    swap_info_get() is used not only in swap free code path but also in
    page_swapcount(), etc. So the original kernel message in swap_info_get()
    is not correct now. Fix it via replacing "swap_free" to "swap_info_get"
    in the message.

    Link: http://lkml.kernel.org/r/9b5f8bd6266f9da978c373f2384c8044df5e262c.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Reviewed-by: Rik van Riel
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang, Ying
     

11 Jan, 2017

1 commit

  • During developemnt for zram-swap asynchronous writeback, I found strange
    corruption of compressed page, resulting in:

    Modules linked in: zram(E)
    CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff88007620b840 task.stack: ffff880078090000
    RIP: set_freeobj.part.43+0x1c/0x1f
    RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
    RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
    RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
    RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
    R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
    FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
    Call Trace:
    obj_malloc+0x22b/0x260
    zs_malloc+0x1e4/0x580
    zram_bvec_rw+0x4cd/0x830 [zram]
    page_requests_rw+0x9c/0x130 [zram]
    zram_thread+0xe6/0x173 [zram]
    kthread+0xca/0xe0
    ret_from_fork+0x25/0x30

    With investigation, it reveals currently stable page doesn't support
    anonymous page. IOW, reuse_swap_page can reuse the page without waiting
    writeback completion so it can overwrite page zram is compressing.

    Unfortunately, zram has used per-cpu stream feature from v4.7.
    It aims for increasing cache hit ratio of scratch buffer for
    compressing. Downside of that approach is that zram should ask
    memory space for compressed page in per-cpu context which requires
    stricted gfp flag which could be failed. If so, it retries to
    allocate memory space out of per-cpu context so it could get memory
    this time and compress the data again, copies it to the memory space.

    In this scenario, zram assumes the data should never be changed
    but it is not true unless stable page supports. So, If the data is
    changed under us, zram can make buffer overrun because second
    compression size could be bigger than one we got in previous trial
    and blindly, copy bigger size object to smaller buffer which is
    buffer overrun. The overrun breaks zsmalloc free object chaining
    so system goes crash like above.

    I think below is same problem.
    https://bugzilla.suse.com/show_bug.cgi?id=997574

    Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
    writeback in there so the approach in this patch is simply return false if
    we found it needs stable page. Although it increases memory footprint
    temporarily, it happens rarely and it should be reclaimed easily althoug
    it happened. Also, It would be better than waiting of IO completion,
    which is critial path for application latency.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
    Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Darrick J. Wong
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

13 Dec, 2016

1 commit

  • Add a cond_resched() in the unuse_pmd_range() loop (so as to call it
    even when pmd none or trans_huge, like zap_pmd_range() does); and in the
    unuse_mm() loop (since that might skip over many vmas). shmem_unuse()
    and radix_tree_locate_item() look good enough already.

    Those were the obvious places, but in fact the stalls came from
    find_next_to_unuse(), which sometimes scans through many unused entries.
    Apply scan_swap_map()'s LATENCY_LIMIT of 256 there too; and only go off
    to test frontswap_map when a used entry is found.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052155140.13021@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Nov, 2016

1 commit

  • When root activates a swap partition whose header has the wrong
    endianness, nr_badpages elements of badpages are swabbed before
    nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

    This normally is not a security issue because it can only be exploited
    by root (more specifically, a process with CAP_SYS_ADMIN or the ability
    to modify a swap file/partition), and such a process can already e.g.
    modify swapped-out memory of any other userspace process on the system.

    Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Acked-by: Jerome Marchand
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn