06 Apr, 2019

1 commit

  • [ Upstream commit c10d38cc8d3e43f946b6c2bf4602c86791587f30 ]

    Dan Carpenter reports a potential NULL dereference in
    get_swap_page_of_type:

    Smatch complains that the NULL checks on "si" aren't consistent. This
    seems like a real bug because we have not ensured that the type is
    valid and so "si" can be NULL.

    Add the missing check for NULL, taking care to use a read barrier to
    ensure CPU1 observes CPU0's updates in the correct order:

    CPU0 CPU1
    alloc_swap_info() if (type >= nr_swapfiles)
    swap_info[type] = p /* handle invalid entry */
    smp_wmb() smp_rmb()
    ++nr_swapfiles p = swap_info[type]

    Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
    CPU0's write to swap_info[type] and read NULL from swap_info[type].

    Ying Huang noticed other places in swapfile.c don't order these reads
    properly. Introduce swap_type_to_swap_info to encourage correct usage.

    Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
    (see tools/memory-model/Documentation/explanation.txt).

    This ordering need not be enforced in places where swap_lock is held
    (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
    and the swap_info array.

    Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
    Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
    Signed-off-by: Daniel Jordan
    Reported-by: Dan Carpenter
    Suggested-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Acked-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Omar Sandoval
    Cc: Paul McKenney
    Cc: Shaohua Li
    Cc: Stephen Rothwell
    Cc: Tejun Heo
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Daniel Jordan
     

26 Jan, 2019

1 commit

  • [ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ]

    Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
    avail_lists field of swap_info_struct is changed to an array with
    MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB
    and needs an order-4 page to hold it.

    This is not optimal in that:
    1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
    is a waste of memory;
    2 It could cause swapon failure if the swap device is swapped on
    after system has been running for a while, due to no order-4
    page is available as pointed out by Vasily Averin.

    Solve the above two issues by using nr_node_ids(which is the actual
    possible node number the running system has) for avail_lists instead of
    MAX_NUMNODES.

    nr_node_ids is unknown at compile time so can't be directly used when
    declaring this array. What I did here is to declare avail_lists as zero
    element array and allocate space for it when allocating space for
    swap_info_struct. The reason why keep using array but not pointer is
    plist_for_each_entry needs the field to be part of the struct, so pointer
    will not work.

    This patch is on top of Vasily Averin's fix commit. I think the use of
    kvzalloc for swap_info_struct is still needed in case nr_node_ids is
    really big on some systems.

    Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
    Signed-off-by: Aaron Lu
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vasily Averin
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Aaron Lu
     

13 Jan, 2019

1 commit

  • commit 7af7a8e19f0c5425ff639b0f0d2d244c2a647724 upstream.

    KSM pages may be mapped to the multiple VMAs that cannot be reached from
    one anon_vma. So during swapin, a new copy of the page need to be
    generated if a different anon_vma is needed, please refer to comments of
    ksm_might_need_to_copy() for details.

    During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
    virtual address mapped to the page, so not all mappings to a swapped out
    KSM page could be found. So in try_to_unuse(), even if the swap count of
    a swap entry isn't zero, the page needs to be deleted from swap cache, so
    that, in the next round a new page could be allocated and swapin for the
    other mappings of the swapped out KSM page.

    But this contradicts with the THP swap support. Where the THP could be
    deleted from swap cache only after the swap count of every swap entry in
    the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
    changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
    space for THP swapped out") to check that before delete a page from swap
    cache, but this has broken KSM swapoff too.

    Fortunately, KSM is for the normal pages only, so the original behavior
    for KSM pages could be restored easily via checking PageTransCompound().
    That is how this patch works.

    The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
    swap space for THP swapped out"), which is merged by v4.14-rc1. So I
    think we should backport the fix to from 4.14 on. But Hugh thinks it may
    be rare for the KSM pages being in the swap device when swapoff, so nobody
    reports the bug so far.

    Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
    Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
    Signed-off-by: "Huang, Ying"
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Daniel Jordan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     

21 Nov, 2018

1 commit

  • commit 873d7bcfd066663e3e50113dc4a0de19289b6354 upstream.

    Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
    changed 'avail_lists' field of 'struct swap_info_struct' to an array.
    In popular linux distros it increased size of swap_info_struct up to 40
    Kbytes and now swap_info_struct allocation requires order-4 page.
    Switch to kvzmalloc allows to avoid unexpected allocation failures.

    Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
    Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
    Signed-off-by: Vasily Averin
    Acked-by: Aaron Lu
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     

23 Aug, 2018

8 commits

  • In this patch, locking related code is shared between huge/normal code
    path in put_swap_page() to reduce code duplication. The `free_entries == 0`
    case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
    case, because the new locking method makes it easy.

    The added lines is same as the removed lines. But the code size is
    increased when CONFIG_TRANSPARENT_HUGEPAGE=n.

    text data bss dec hex filename
    base: 24123 2004 340 26467 6763 mm/swapfile.o
    unified: 24485 2004 340 26829 68cd mm/swapfile.o

    Dig on step deeper with `size -A mm/swapfile.o` for base and unified
    kernel and compare the result, yields,

    -.text 17723 0
    +.text 17835 0
    -.orc_unwind_ip 1380 0
    +.orc_unwind_ip 1480 0
    -.orc_unwind 2070 0
    +.orc_unwind 2220 0
    -Total 26686
    +Total 27048

    The total difference is the same. The text segment difference is much
    smaller: 112. More difference comes from the ORC unwinder segments:
    (1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is
    used, this costs nothing.

    Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The part of __swap_entry_free() with lock held is separated into a new
    function __swap_entry_free_locked(). Because we want to reuse that
    piece of code in some other places.

    Just mechanical code refactoring, there is no any functional change in
    this function.

    Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • As suggested by Matthew Wilcox, it is better to use "int entry_size"
    instead of "bool cluster" as parameter to specify whether to operate for
    huge or normal swap entries. Because this improve the flexibility to
    support other swap entry size. And Dave Hansen thinks that this
    improves code readability too.

    So in this patch, the "bool cluster" parameter of get_swap_pages() is
    replaced by "int entry_size".

    And nr_swap_entries() trick is used to reduce the binary size when
    !CONFIG_TRANSPARENT_HUGE_PAGE.

    text data bss dec hex filename
    base 24215 2028 340 26583 67d7 mm/swapfile.o
    head 24123 2004 340 26467 6763 mm/swapfile.o

    Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Matthew Wilcox
    Acked-by: Dave Hansen
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In this patch, the normal/huge code path in put_swap_page() and several
    helper functions are unified to avoid duplicated code, bugs, etc. and
    make it easier to review the code.

    The removed lines are more than added lines. And the binary size is
    kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.

    Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Dave Hansen
    Acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • As suggested by Dave, we should unify the code path for normal and huge
    swap support if possible to avoid duplicated code, bugs, etc. and make
    it easier to review code.

    In this patch, the normal/huge code path in
    swap_page_trans_huge_swapped() is unified, the added and removed lines
    are same. And the binary size is kept almost same when
    CONFIG_TRANSPARENT_HUGEPAGE=n.

    text data bss dec hex filename
    base: 24179 2028 340 26547 67b3 mm/swapfile.o
    unified: 24215 2028 340 26583 67d7 mm/swapfile.o

    Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In swap_page_trans_huge_swapped(), to identify whether there's any page
    table mapping for a 4k sized swap entry, "si->swap_map[i] !=
    SWAP_HAS_CACHE" is used. This works correctly now, because all users of
    the function will only call it after checking SWAP_HAS_CACHE. But as
    pointed out by Daniel, it is better to use "swap_count(map[i])" here,
    because it works for "map[i] == 0" case too.

    And this makes the implementation more consistent between normal and
    huge swap entry.

    Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is
    enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when
    THP isn't enabled. But #ifdef/#endif in .c file hurt the code
    readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP)
    instead and let compiler to do the dirty job for us. This has potential
    to remove some duplicated code too. From output of `size`,

    text data bss dec hex filename
    THP=y: 26269 2076 340 28685 700d mm/swapfile.o
    ifdef/endif: 24115 2028 340 26483 6773 mm/swapfile.o
    IS_ENABLED: 24179 2028 340 26547 67b3 mm/swapfile.o

    IS_ENABLED() based solution works quite well, almost as good as that of
    #ifdef/#endif. And from the diffstat, the removed lines are more than
    added lines.

    One #ifdef for split_swap_cluster() is kept. Because it is a public
    function with a stub implementation for CONFIG_THP_SWAP=n in swap.h.

    Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "swap: THP optimizing refactoring", v4.

    Now the THP (Transparent Huge Page) swap optimizing is implemented in the
    way like below,

    #ifdef CONFIG_THP_SWAP
    huge_function(...)
    {
    }
    #else
    normal_function(...)
    {
    }
    #endif

    general_function(...)
    {
    if (huge)
    return thp_function(...);
    else
    return normal_function(...);
    }

    As pointed out by Dave Hansen, this will,

    1. Create a new, wholly untested code path for huge page
    2. Create two places to patch bugs
    3. Are not reusing code when possible

    This patchset is to address these problems via merging huge/normal code
    path/functions if possible.

    One concern is that this may cause code size to dilate when
    !CONFIG_TRANSPARENT_HUGEPAGE. The data shows that most refactoring will
    only cause quite slight code size increase.

    This patch (of 8):

    To improve code readability.

    Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

09 Jul, 2018

1 commit

  • Memory allocations can induce swapping via kswapd or direct reclaim. If
    we are having IO done for us by kswapd and don't actually go into direct
    reclaim we may never get scheduled for throttling. So instead check to
    see if our cgroup is congested, and if so schedule the throttling.
    Before we return to user space the throttling stuff will only throttle
    if we actually required it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Jun, 2018

1 commit

  • For the L1TF workaround its necessary to limit the swap file size to below
    MAX_PA/2, so that the higher bits of the swap offset inverted never point
    to valid memory.

    Add a mechanism for the architecture to override the swap file size check
    in swapfile.c and add a x86 specific max swapfile check function that
    enforces that limit.

    The check is only enabled if the CPU is vulnerable to L1TF.

    In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
    with 46bit PA it is 32TB. The limit is only per individual swap file, so
    it's always possible to exceed these limits with multiple swap files or
    partitions.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen

    Andi Kleen
     

15 Jun, 2018

1 commit

  • Commit 570a335b8e22 ("swap_info: swap count continuations") introduces
    COUNT_CONTINUED but refers to it incorrectly as SWAP_HAS_CONT in a
    comment in swap_count. Fix it.

    Link: http://lkml.kernel.org/r/20180612175919.30413-1-daniel.m.jordan@oracle.com
    Fixes: 570a335b8e22 ("swap_info: swap count continuations")
    Signed-off-by: Daniel Jordan
    Reviewed-by: Andrew Morton
    Cc: "Huang, Ying"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

13 Jun, 2018

1 commit

  • The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
    patch replaces cases of:

    kvzalloc(a * b, gfp)

    with:
    kvcalloc(a * b, gfp)

    as well as handling cases of:

    kvzalloc(a * b * c, gfp)

    with:

    kvzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvcalloc(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvzalloc
    + kvcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(sizeof(THING) * C2, ...)
    |
    kvzalloc(sizeof(TYPE) * C2, ...)
    |
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(C1 * C2, ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

26 May, 2018

1 commit

  • If swapon() fails after incrementing nr_rotate_swap, we don't decrement
    it and thus effectively leak it. Make sure we decrement it if we
    incremented it.

    Link: http://lkml.kernel.org/r/b6fe6b879f17fa68eee6cbd876f459f6e5e33495.1526491581.git.osandov@fb.com
    Fixes: 81a0298bdfab ("mm, swap: don't use VMA based swap readahead if HDD is used as swap")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Rik van Riel
    Reviewed-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

12 Apr, 2018

2 commits

  • The pointer swap_avail_heads is local to the source and does not need to
    be in global scope, so make it static.

    Cleans up sparse warning:

    mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180206215836.12366-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Abraham
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit


16 Nov, 2017

3 commits

  • When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several
    processes, it can cause unnecessary memory wastage by skipping swap
    cache. Because, with swapin fault by read, they could share a page if
    the page were in swap cache. Thus, it avoids allocating same content
    new pages.

    This patch makes the swapcache skipping work only if the swap pte is
    non-sharable.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1507620825-5537-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With fast swap storage, the platforms want to use swap more aggressively
    and swap-in is crucial to application latency.

    The rw_page() based synchronous devices like zram, pmem and btt are such
    fast storage. When I profile swapin performance with zram lz4
    decompress test, S/W overhead is more than 70%. Maybe, it would be
    bigger in nvdimm.

    This patch aims to reduce swap-in latency by skipping swapcache if the
    swap device is synchronous device like rw_page based device. It
    enhances 45% my swapin test(5G sequential swapin, no readahead, from
    2.41sec to 1.64sec).

    Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If rw-page based fast storage is used for swap devices, we need to
    detect it to enhance swap IO operations. This patch is preparation for
    optimizing of swap-in operation with next patch.

    Link: http://lkml.kernel.org/r/1505886205-9671-4-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Ross Zwisler
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

03 Nov, 2017

1 commit

  • One page may store a set of entries of the sis->swap_map
    (swap_info_struct->swap_map) in multiple swap clusters.

    If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
    multiple pages will be used to store the set of entries of the
    sis->swap_map. And the pages are linked with page->lru. This is called
    swap count continuation. To access the pages which store the set of
    entries of the sis->swap_map simultaneously, previously, sis->lock is
    used. But to improve the scalability of __swap_duplicate(), swap
    cluster lock may be used in swap_count_continued() now. This may race
    with add_swap_count_continuation() which operates on a nearby swap
    cluster, in which the sis->swap_map entries are stored in the same page.

    The race can cause wrong swap count in practice, thus cause unfreeable
    swap entries or software lockup, etc.

    To fix the race, a new spin lock called cont_lock is added to struct
    swap_info_struct to protect the swap count continuation page list. This
    is a lock at the swap device level, so the scalability isn't very well.
    But it is still much better than the original sis->lock, because it is
    only acquired/released when swap count continuation is used. Which is
    considered rare in practice. If it turns out that the scalability
    becomes an issue for some workloads, we can split the lock into some
    more fine grained locks.

    Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Aaron Lu
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 Sep, 2017

2 commits

  • Free frontswap_map if an error is encountered before enable_swap_info().

    Signed-off-by: David Rientjes
    Reviewed-by: "Huang, Ying"
    Cc: Darrick J. Wong
    Cc: Hugh Dickins
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If initializing a small swap file fails because the swap file has a
    problem (holes, etc.) then we need to free the cluster info as part of
    cleanup. Unfortunately a previous patch changed the code to use kvzalloc
    but did not change all the vfree calls to use kvfree.

    Found by running generic/357 from xfstests.

    Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia
    Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: "Huang, Ying"
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

07 Sep, 2017

7 commits

  • If the system has more than one swap device and swap device has the node
    information, we can make use of this information to decide which swap
    device to use in get_swap_pages() to get better performance.

    The current code uses a priority based list, swap_avail_list, to decide
    which swap device to use and if multiple swap devices share the same
    priority, they are used round robin. This patch changes the previous
    single global swap_avail_list into a per-numa-node list, i.e. for each
    numa node, it sees its own priority based list of available swap
    devices. Swap device's priority can be promoted on its matching node's
    swap_avail_list.

    The current swap device's priority is set as: user can set a >=0 value,
    or the system will pick one starting from -1 then downwards. The
    priority value in the swap_avail_list is the negated value of the swap
    device's due to plist being sorted from low to high. The new policy
    doesn't change the semantics for priority >=0 cases, the previous
    starting from -1 then downwards now becomes starting from -2 then
    downwards and -1 is reserved as the promoted value.

    Take 4-node EX machine as an example, suppose 4 swap devices are
    available, each sit on a different node:
    swapA on node 0
    swapB on node 1
    swapC on node 2
    swapD on node 3

    After they are all swapped on in the sequence of ABCD.

    Current behaviour:
    their priorities will be:
    swapA: -1
    swapB: -2
    swapC: -3
    swapD: -4
    And their position in the global swap_avail_list will be:
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:2 prio:3 prio:4

    New behaviour:
    their priorities will be(note that -1 is skipped):
    swapA: -2
    swapB: -3
    swapC: -4
    swapD: -5
    And their positions in the 4 swap_avail_lists[nid] will be:
    swap_avail_lists[0]: /* node 0's available swap device list */
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:3 prio:4 prio:5
    swap_avali_lists[1]: /* node 1's available swap device list */
    swapB -> swapA -> swapC -> swapD
    prio:1 prio:2 prio:4 prio:5
    swap_avail_lists[2]: /* node 2's available swap device list */
    swapC -> swapA -> swapB -> swapD
    prio:1 prio:2 prio:3 prio:5
    swap_avail_lists[3]: /* node 3's available swap device list */
    swapD -> swapA -> swapB -> swapC
    prio:1 prio:2 prio:3 prio:4

    To see the effect of the patch, a test that starts N process, each mmap
    a region of anonymous memory and then continually write to it at random
    position to trigger both swap in and out is used.

    On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
    are used as swap devices with each attached to a different node, the
    result is:

    runtime=30m/processes=32/total test size=128G/each process mmap region=4G
    kernel throughput
    vanilla 13306
    auto-binding 15169 +14%

    runtime=30m/processes=64/total test size=128G/each process mmap region=2G
    kernel throughput
    vanilla 11885
    auto-binding 14879 +25%

    [aaron.lu@intel.com: v2]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    [akpm@linux-foundation.org: use kmalloc_array()]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    Signed-off-by: Aaron Lu
    Cc: "Chen, Tim C"
    Cc: Huang Ying
    Cc: Andi Kleen
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • VMA based swap readahead will readahead the virtual pages that is
    continuous in the virtual address space. While the original swap
    readahead will readahead the swap slots that is continuous in the swap
    device. Although VMA based swap readahead is more correct for the swap
    slots to be readahead, it will trigger more small random readings, which
    may cause the performance of HDD (hard disk) to degrade heavily, and may
    finally exceed the benefit.

    To avoid the issue, in this patch, if the HDD is used as swap, the VMA
    based swap readahead will be disabled, and the original swap readahead
    will be used instead.

    Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After adding swapping out support for THP (Transparent Huge Page), it is
    possible that a THP in swap cache (partly swapped out) need to be split.
    To split such a THP, the swap cluster backing the THP need to be split
    too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap
    cluster. The patch implemented this.

    And because the THP swap writing needs the THP keeps as huge page during
    writing. The PageWriteback flag is checked before splitting.

    Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • It's hard to write a whole transparent huge page (THP) to a file backed
    swap device during swapping out and the file backed swap device isn't
    very popular. So the huge cluster allocation for the file backed swap
    device is disabled.

    Link: http://lkml.kernel.org/r/20170724051840.2309-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After supporting to delay THP (Transparent Huge Page) splitting after
    swapped out, it is possible that some page table mappings of the THP are
    turned into swap entries. So reuse_swap_page() need to check the swap
    count in addition to the map count as before. This patch done that.

    In the huge PMD write protect fault handler, in addition to the page map
    count, the swap count need to be checked too, so the page lock need to
    be acquired too when calling reuse_swap_page() in addition to the page
    table lock.

    [ying.huang@intel.com: silence a compiler warning]
    Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The normal swap slot reclaiming can be done when the swap count reaches
    SWAP_HAS_CACHE. But for the swap slot which is backing a THP, all swap
    slots backing one THP must be reclaimed together, because the swap slot
    may be used again when the THP is swapped out again later. So the swap
    slots backing one THP can be reclaimed together when the swap count for
    all swap slots for the THP reached SWAP_HAS_CACHE. In the patch, the
    functions to check whether the swap count for all swap slots backing one
    THP reached SWAP_HAS_CACHE are implemented and used when checking
    whether a swap slot can be reclaimed.

    To make it easier to determine whether a swap slot is backing a THP, a
    new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
    cluster which is backing a THP (Transparent Huge Page). Because THP
    swap in as a whole isn't supported now. After deleting the THP from the
    swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
    flag will be cleared. So that, the normal pages inside THP can be
    swapped in individually.

    [ying.huang@intel.com: fix swap_page_trans_huge_swapped on HDD]
    Link: http://lkml.kernel.org/r/874ltsm0bi.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20170724051840.2309-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "mm, THP, swap: Delay splitting THP after swapped out", v3.

    This is the second step of THP (Transparent Huge Page) swap
    optimization. In the first step, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. In the second
    step, the splitting is delayed further to after the swapping out
    finished. The plan is to delay splitting THP step by step, finally
    avoid splitting THP for the THP swapping out and swap out/in the THP as
    a whole.

    In the patchset, more operations for the anonymous THP reclaiming, such
    as TLB flushing, writing the THP to the swap device, removing the THP
    from the swap cache are batched. So that the performance of anonymous
    THP swapping out are improved.

    During the development, the following scenarios/code paths have been
    checked,

    - swap out/in
    - swap off
    - write protect page fault
    - madvise_free
    - process exit
    - split huge page

    With the patchset, the swap out throughput improves 42% (from about
    5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
    with 16 processes. At the same time, the IPI (reflect TLB flushing)
    reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Below is the part of the cover letter for the first step patchset of THP
    swap optimization which applies to all steps.

    =========================

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce TLB flushing and lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary.

    For example, it can be selected via "always/never/madvise" logic, to be
    turned on globally, turned off globally, or turned on only for VMA with
    MADV_HUGEPAGE, etc.

    This patch (of 12):

    Previously, swapcache_free_cluster() is used only in the error path of
    shrink_page_list() to free the swap cluster just allocated if the THP
    (Transparent Huge Page) is failed to be split. In this patch, it is
    enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
    cluster that holds the contents of THP swapped out.

    This will be used in delaying splitting THP after swapping out support.
    Because there is no THP swapping in as a whole support yet, after
    clearing the swap cache flag, the swap cluster backing the THP swapped
    out will be split. So that the swap slots in the swap cluster can be
    swapped in as normal pages later.

    Link: http://lkml.kernel.org/r/20170724051840.2309-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

11 Jul, 2017

1 commit

  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Jul, 2017

3 commits

  • To reduce the lock contention of swap_info_struct->lock when freeing
    swap entry. The freed swap entries will be collected in a per-CPU
    buffer firstly, and be really freed later in batch. During the batch
    freeing, if the consecutive swap entries in the per-CPU buffer belongs
    to same swap device, the swap_info_struct->lock needs to be
    acquired/released only once, so that the lock contention could be
    reduced greatly. But if there are multiple swap devices, it is possible
    that the lock may be unnecessarily released/acquired because the swap
    entries belong to the same swap device are non-consecutive in the
    per-CPU buffer.

    To solve the issue, the per-CPU buffer is sorted according to the swap
    device before freeing the swap entries.

    With the patch, the memory (some swapped out) free time reduced 11.6%
    (from 2.65s to 2.35s) in the vm-scalability swap-w-rand test case with
    16 processes. The test is done on a Xeon E5 v3 system. The swap device
    used is a RAM simulated PMEM (persistent memory) device. To test
    swapping, the test case creates 16 processes, which allocate and write
    to the anonymous pages until the RAM and part of the swap device is used
    up, finally the memory (some swapped out) is freed before exit.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/20170525005916.25249-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 May, 2017

1 commit

  • Now vzalloc() is used in swap code to allocate various data structures,
    such as swap cache, swap slots cache, cluster info, etc. Because the
    size may be too large on some system, so that normal kzalloc() may fail.
    But using kzalloc() has some advantages, for example, less memory
    fragmentation, less TLB pressure, etc. So change the data structure
    allocation in swap code to use kvzalloc() which will try kzalloc()
    firstly, and fallback to vzalloc() if kzalloc() failed.

    In general, although kmalloc() will reduce the number of high-order
    pages in short term, vmalloc() will cause more pain for memory
    fragmentation in the long term. And the swap data structure allocation
    that is changed in this patch is expected to be long term allocation.

    From Dave Hansen:
    "for example, we have a two-page data structure. vmalloc() takes two
    effectively random order-0 pages, probably from two different 2M pages
    and pins them. That "kills" two 2M pages. kmalloc(), allocating two
    *contiguous* pages, will not cross a 2M boundary. That means it will
    only "kill" the possibility of a single 2M page. More 2M pages == less
    fragmentation.

    The allocation in this patch occurs during swap on time, which is
    usually done during system boot, so usually we have high opportunity to
    allocate the contiguous pages successfully.

    The allocation for swap_map[] in struct swap_info_struct is not changed,
    because that is usually quite large and vmalloc_to_page() is used for
    it. That makes it a little harder to change.

    Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying