14 Oct, 2020

1 commit


08 Aug, 2020

3 commits

  • Because enable_swap_slots_cache can only become true in
    enable_swap_slots_cache(), and depends on swap_slot_cache_initialized is
    true before. That means, when enable_swap_slots_cache is true,
    swap_slot_cache_initialized is true also.

    So the condition:
    "swap_slot_cache_enabled && swap_slot_cache_initialized"
    can be reduced to "swap_slot_cache_enabled"

    And in mathematics:
    "!swap_slot_cache_enabled || !swap_slot_cache_initialized"
    is equal to "!(swap_slot_cache_enabled && swap_slot_cache_initialized)"

    So no functional change.

    Signed-off-by: Zhen Lei
    Signed-off-by: Andrew Morton
    Acked-by: Tim Chen
    Link: http://lkml.kernel.org/r/20200430061143.450-4-thunder.leizhen@huawei.com
    Signed-off-by: Linus Torvalds

    Zhen Lei
     
  • Whether swap_slot_cache_initialized is true or false,
    __reenable_swap_slots_cache() is always called. To make this meaning
    clear, leave only one call to __reenable_swap_slots_cache(). This also
    make it clearer what extra needs be done when swap_slot_cache_initialized
    is false.

    No functional change.

    Signed-off-by: Zhen Lei
    Signed-off-by: Andrew Morton
    Acked-by: Tim Chen
    Link: http://lkml.kernel.org/r/20200430061143.450-3-thunder.leizhen@huawei.com
    Signed-off-by: Linus Torvalds

    Zhen Lei
     
  • Patch series "clean up some functions in mm/swap_slots.c".

    When I studied the code of mm/swap_slots.c, I found some places can be
    improved.

    This patch (of 3):

    Both "slots" and "slots_ret" are only need to be freed when cache already
    allocated. Make them closer, seems more clear.

    No functional change.

    Signed-off-by: Zhen Lei
    Signed-off-by: Andrew Morton
    Acked-by: Tim Chen
    Link: http://lkml.kernel.org/r/20200430061143.450-1-thunder.leizhen@huawei.com
    Link: http://lkml.kernel.org/r/20200430061143.450-2-thunder.leizhen@huawei.com
    Signed-off-by: Linus Torvalds

    Zhen Lei
     

03 Apr, 2020

1 commit

  • Currently we use a tmp pointer, pentry, to transfer and reset swap cache
    slot, which is a little redundant. Swap cache slot stores the entry value
    directly, assign and reset it by value would be straight forward.

    Also this patch merges the else and if, since this is the only case we
    refill and repeat swap cache.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: Tim Chen
    Link: http://lkml.kernel.org/r/20200311055352.50574-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

23 Aug, 2018

1 commit

  • As suggested by Matthew Wilcox, it is better to use "int entry_size"
    instead of "bool cluster" as parameter to specify whether to operate for
    huge or normal swap entries. Because this improve the flexibility to
    support other swap entry size. And Dave Hansen thinks that this
    improves code readability too.

    So in this patch, the "bool cluster" parameter of get_swap_pages() is
    replaced by "int entry_size".

    And nr_swap_entries() trick is used to reduce the binary size when
    !CONFIG_TRANSPARENT_HUGE_PAGE.

    text data bss dec hex filename
    base 24215 2028 340 26583 67d7 mm/swapfile.o
    head 24123 2004 340 26467 6763 mm/swapfile.o

    Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Matthew Wilcox
    Acked-by: Dave Hansen
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

18 Aug, 2018

1 commit

  • The mutexes swap_slots_cache_mutex and swap_slots_cache_enable_mutex are
    local to the source and do not need to be in global scope, so make them
    static.

    Cleans up sparse warnings:
    symbol 'swap_slots_cache_mutex' was not declared. Should it be static?
    symbol 'swap_slots_cache_enable_mutex' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180624182536.4937-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

13 Jun, 2018

1 commit

  • The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
    patch replaces cases of:

    kvzalloc(a * b, gfp)

    with:
    kvcalloc(a * b, gfp)

    as well as handling cases of:

    kvzalloc(a * b * c, gfp)

    with:

    kvzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvcalloc(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvzalloc
    + kvcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(sizeof(THING) * C2, ...)
    |
    kvzalloc(sizeof(TYPE) * C2, ...)
    |
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(C1 * C2, ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

1 commit

  • Patch series "mm, memcontrol: Implement memory.swap.events", v2.

    This patchset implements memory.swap.events which contains max and fail
    events so that userland can monitor and respond to swap running out.

    This patch (of 2):

    get_swap_page() is always followed by mem_cgroup_try_charge_swap().
    This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
    makes get_swap_page() call the function even after swap allocation
    failure.

    This simplifies the callers and consolidates memcg related logic and
    will ease adding swap related memcg events.

    Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

06 Apr, 2018

1 commit

  • For mm/swap_slots.c, use the traditional Linux method of conditional
    compilation and linking instead of always compiling it by using #ifdef
    CONFIG_SWAP and #endif for the entire source file (excluding header
    files).

    Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org
    Signed-off-by: Randy Dunlap
    Acked-by: Tim Chen
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

16 Nov, 2017

1 commit

  • Memory allocations can happen before the swap_slots cache initialization
    is completed during cpu bring up. If we are low on memory, we could
    call get_swap_page() and access swap_slots_cache before it is fully
    initialized.

    Add a check in get_swap_page() for initialized swap_slots_cache to
    prevent this condition. Similar check already exists in free_swap_slot.
    Also annotate the checks to indicate the likely condition.

    We also added a memory barrier to make sure that the locks
    initialization are done before the assignment of cache->slots and
    cache->slots_ret pointers. This ensures the assumption that it is safe
    to acquire the slots cache locks and use the slots cache when the
    corresponding cache->slots or cache->slots_ret pointers are non null.

    [akpm@linux-foundation.org: tidy up comment]
    [akpm@linux-foundation.org: fix spello in comment]
    Link: http://lkml.kernel.org/r/65a9d0f133f63e66bba37b53b2fd0464b7cae771.1500677066.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Reported-by: Wenwei Tao
    Acked-by: Ying Huang
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

11 Jul, 2017

1 commit

  • get_cpu_var() disables preemption and returns the per-CPU version of the
    variable. Disabling preemption is useful to ensure atomic access to the
    variable within the critical section.

    In this case however, after the per-CPU version of the variable is
    obtained the ->free_lock is acquired. For that reason it seems the raw
    accessor could be used. It only seems that ->slots_ret should be
    retested (because with disabled preemption this variable can not be set
    to NULL otherwise).

    This popped up during PREEMPT-RT testing because it tries to take
    spinlocks in a preempt disabled section. In RT, spinlocks can sleep.

    Link: http://lkml.kernel.org/r/20170623114755.2ebxdysacvgxzott@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Michal Hocko
    Cc: Tim Chen
    Cc: Thomas Gleixner
    Cc: Ying Huang
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

07 Jul, 2017

1 commit

  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 May, 2017

1 commit

  • Now vzalloc() is used in swap code to allocate various data structures,
    such as swap cache, swap slots cache, cluster info, etc. Because the
    size may be too large on some system, so that normal kzalloc() may fail.
    But using kzalloc() has some advantages, for example, less memory
    fragmentation, less TLB pressure, etc. So change the data structure
    allocation in swap code to use kvzalloc() which will try kzalloc()
    firstly, and fallback to vzalloc() if kzalloc() failed.

    In general, although kmalloc() will reduce the number of high-order
    pages in short term, vmalloc() will cause more pain for memory
    fragmentation in the long term. And the swap data structure allocation
    that is changed in this patch is expected to be long term allocation.

    From Dave Hansen:
    "for example, we have a two-page data structure. vmalloc() takes two
    effectively random order-0 pages, probably from two different 2M pages
    and pins them. That "kills" two 2M pages. kmalloc(), allocating two
    *contiguous* pages, will not cross a 2M boundary. That means it will
    only "kill" the possibility of a single 2M page. More 2M pages == less
    fragmentation.

    The allocation in this patch occurs during swap on time, which is
    usually done during system boot, so usually we have high opportunity to
    allocate the contiguous pages successfully.

    The allocation for swap_map[] in struct swap_info_struct is not changed,
    because that is usually quite large and vmalloc_to_page() is used for
    it. That makes it a little harder to change.

    Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

04 May, 2017

1 commit


22 Mar, 2017

1 commit

  • Before commit 452b94b8c8c7 ("mm/swap: don't BUG_ON() due to
    uninitialized swap slot cache"), the following bug is reported,

    ------------[ cut here ]------------
    kernel BUG at mm/swap_slots.c:270!
    invalid opcode: 0000 [#1] SMP
    CPU: 5 PID: 1745 Comm: (sd-pam) Not tainted 4.11.0-rc1-00243-g24c534bb161b #1
    Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
    RIP: 0010:free_swap_slot+0xba/0xd0
    Call Trace:
    swap_free+0x36/0x40
    do_swap_page+0x360/0x6d0
    __handle_mm_fault+0x880/0x1080
    handle_mm_fault+0xd0/0x240
    __do_page_fault+0x232/0x4d0
    do_page_fault+0x20/0x70
    page_fault+0x22/0x30
    ---[ end trace aefc9ede53e0ab21 ]---

    This is raised by the BUG_ON(!swap_slot_cache_initialized) in
    free_swap_slot(). This is incorrect, because even if the swap slots
    cache fails to be initialized, the swap should operate properly without
    the swap slots cache. And the use_swap_slot_cache check later in the
    function will protect the uninitialized swap slots cache case.

    In commit 452b94b8c8c7, the BUG_ON() is replaced by WARN_ON_ONCE(). In
    the patch, the WARN_ON_ONCE() is removed too.

    Reported-by: Linus Torvalds
    Acked-by: Tim Chen
    Cc: Michal Hocko
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Linus Torvalds

    Huang Ying
     

20 Mar, 2017

1 commit

  • This BUG_ON() triggered for me once at shutdown, and I don't see a
    reason for the check. The code correctly checks whether the swap slot
    cache is usable or not, so an uninitialized swap slot cache is not
    actually problematic afaik.

    I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
    I'm not sure why that seemingly pointless check was there. I suspect
    the real fix is to just remove it entirely, but for now we'll warn about
    it but not bring the machine down.

    Cc: "Huang, Ying"
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Feb, 2017

2 commits

  • Because during swap off, a swap entry may have swap_map[] ==
    SWAP_HAS_CACHE (for example, just allocated). If we return NULL in
    __read_swap_cache_async(), the swap off will abort. So when swap slot
    cache is disabled, (for swap off), we will wait for page to be put into
    swap cache in such race condition. This should not be a problem for swap
    slot cache, because swap slot cache should be drained after clearing
    swap_slot_cache_enabled.

    [ying.huang@intel.com: fix memory leak in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/874lzt6znd.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/5e2c5f6abe8e6eb0797408897b1bba80938e9b9d.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • We add per cpu caches for swap slots that can be allocated and freed
    quickly without the need to touch the swap info lock.

    Two separate caches are maintained for swap slots allocated and swap
    slots returned. This is to allow the swap slots to be returned to the
    global pool in a batch so they will have a chance to be coaelesced with
    other slots in a cluster. We do not reuse the slots that are returned
    right away, as it may increase fragmentation of the slots.

    The swap allocation cache is protected by a mutex as we may sleep when
    searching for empty slots in cache. The swap free cache is protected by
    a spin lock as we cannot sleep in the free path.

    We refill the swap slots cache when we run out of slots, and we disable
    the swap slots cache and drain the slots if the global number of slots
    fall below a low watermark threshold. We re-enable the cache agian when
    the slots available are above a high watermark.

    [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
    [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
    Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
    Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Michal Hocko
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen