25 Sep, 2019

2 commits

  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    Kirill and Huang Ying contributed several fixes.

    [willy@infradead.org: use compound_nr, squish uninit-var warning]
    Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-by: Song Liu
    Tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Tested-by: Mikhail Gavrilov
    Cc: Hugh Dickins
    Cc: Chris Wilson
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Jul, 2019

2 commits

  • total_swapcache_pages() may race with swapper_spaces[] allocation and
    freeing. Previously, this is protected with a swapper_spaces[] specific
    RCU mechanism. To simplify the logic/code complexity, it is replaced with
    get/put_swap_device(). The code line number is reduced too. Although not
    so important, the swapoff() performance improves too because one
    synchronize_rcu() call during swapoff() is deleted.

    [ying.huang@intel.com: fix bad swap file entry warning]
    Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
    Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Tested-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Paul E. McKenney
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Cc: Daniel Jordan
    Cc: Andrea Parri
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • When swapin is performed, after getting the swap entry information from
    the page table, system will swap in the swap entry, without any lock held
    to prevent the swap device from being swapoff. This may cause the race
    like below,

    CPU 1 CPU 2
    ----- -----
    do_swap_page
    swapin_readahead
    __read_swap_cache_async
    swapoff swapcache_prepare
    p->swap_map = NULL __swap_duplicate
    p->swap_map[?] /* !!! NULL pointer access */

    Because swapoff is usually done when system shutdown only, the race may
    not hit many people in practice. But it is still a race need to be fixed.

    To fix the race, get_swap_device() is added to check whether the specified
    swap entry is valid in its swap device. If so, it will keep the swap
    entry valid via preventing the swap device from being swapoff, until
    put_swap_device() is called.

    Because swapoff() is very rare code path, to make the normal path runs as
    fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
    reference count is used to implement get/put_swap_device(). >From
    get_swap_device() to put_swap_device(), RCU reader side is locked, so
    synchronize_rcu() in swapoff() will wait until put_swap_device() is
    called.

    In addition to swap_map, cluster_info, etc. data structure in the struct
    swap_info_struct, the swap cache radix tree will be freed after swapoff,
    so this patch fixes the race between swap cache looking up and swapoff
    too.

    Races between some other swap cache usages and swapoff are fixed too via
    calling synchronize_rcu() between clearing PageSwapCache() and freeing
    swap cache data structure.

    Another possible method to fix this is to use preempt_off() +
    stop_machine() to prevent the swap device from being swapoff when its data
    structure is being accessed. The overhead in hot-path of both methods is
    similar. The advantages of RCU based method are,

    1. stop_machine() may disturb the normal execution code path on other
    CPUs.

    2. File cache uses RCU to protect its radix tree. If the similar
    mechanism is used for swap cache too, it is easier to share code
    between them.

    3. RCU is used to protect swap cache in total_swapcache_pages() and
    exit_swap_address_space() already. The two mechanisms can be
    merged to simplify the logic.

    Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Not-nacked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Paul E. McKenney
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

06 Jul, 2019

1 commit

  • This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

    Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
    __delete_from_swap_cache() to trigger:

    page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
    anon
    flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
    raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
    raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
    page dumped because: VM_BUG_ON_PAGE(entry != page)
    page->mem_cgroup:ffff978433ace000
    ------------[ cut here ]------------
    kernel BUG at mm/swap_state.c:170!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
    Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
    RIP: 0010:__delete_from_swap_cache+0x20d/0x240
    Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
    RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
    RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
    RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
    R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
    R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
    FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
    Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

    and it's not immediately obvious why it happens. It's too late in the
    rc cycle to do anything but revert for now.

    Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
    Reported-and-bisected-by: Mikhail Gavrilov
    Suggested-by: Jan Kara
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Kirill Shutemov
    Cc: William Kucharski
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 May, 2019

1 commit

  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    [willy@infradead.org: fix swapcache pages]
    Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
    [kirill@shutemov.name: hugetlb stores pages in page cache differently]
    Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
    Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-and-tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Cc: Hugh Dickins
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Mar, 2019

2 commits

  • swap_vma_readahead()'s comment is missing, just add it.

    Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Minchan Kim
    Cc: Daniel Jordan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Swap readahead would read in a few pages regardless if the underlying
    device is busy or not. It may incur long waiting time if the device is
    congested, and it may also exacerbate the congestion.

    Use inode_read_congested() to check if the underlying device is busy or
    not like what file page readahead does. Get inode from
    swap_info_struct.

    Although we can add inode information in swap_address_space
    (address_space->host), it may lead some unexpected side effect, i.e. it
    may break mapping_cap_account_dirty(). Using inode from
    swap_info_struct seems simple and good enough.

    Just does the check in vma_cluster_readahead() since
    swap_vma_readahead() is just used for non-rotational device which much
    less likely has congestion than traditional HDD.

    Although swap slots may be consecutive on swap partition, it still may
    be fragmented on swap file. This check would help to reduce excessive
    stall for such case.

    The test with page_fault1 of will-it-scale (sometimes tracing may just
    show runtest.py that is the wrapper script of page_fault1), which
    basically launches NR_CPU threads to generate 128MB anonymous pages for
    each thread, on my virtual machine with congested HDD shows long tail
    latency is reduced significantly.

    Without the patch
    page_fault1_thr-1490 [023] 129.311706: funcgraph_entry: #57377.796 us | do_swap_page();
    page_fault1_thr-1490 [023] 129.369103: funcgraph_entry: 5.642us | do_swap_page();
    page_fault1_thr-1490 [023] 129.369119: funcgraph_entry: #1289.592 us | do_swap_page();
    page_fault1_thr-1490 [023] 129.370411: funcgraph_entry: 4.957us | do_swap_page();
    page_fault1_thr-1490 [023] 129.370419: funcgraph_entry: 1.940us | do_swap_page();
    page_fault1_thr-1490 [023] 129.378847: funcgraph_entry: #1411.385 us | do_swap_page();
    page_fault1_thr-1490 [023] 129.380262: funcgraph_entry: 3.916us | do_swap_page();
    page_fault1_thr-1490 [023] 129.380275: funcgraph_entry: #4287.751 us | do_swap_page();

    With the patch
    runtest.py-1417 [020] 301.925911: funcgraph_entry: #9870.146 us | do_swap_page();
    runtest.py-1417 [020] 301.935785: funcgraph_entry: 9.802us | do_swap_page();
    runtest.py-1417 [020] 301.935799: funcgraph_entry: 3.551us | do_swap_page();
    runtest.py-1417 [020] 301.935806: funcgraph_entry: 2.142us | do_swap_page();
    runtest.py-1417 [020] 301.935853: funcgraph_entry: 6.938us | do_swap_page();
    runtest.py-1417 [020] 301.935864: funcgraph_entry: 3.765us | do_swap_page();
    runtest.py-1417 [020] 301.935871: funcgraph_entry: 3.600us | do_swap_page();
    runtest.py-1417 [020] 301.935878: funcgraph_entry: 7.202us | do_swap_page();

    [akpm@linux-foundation.org: code cleanup]
    [yang.shi@linux.alibaba.com: add comment]
    Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Tim Chen
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc: Minchan Kim
    Cc: Daniel Jordan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

1 commit

  • Refaults happen during transitions between workingsets as well as in-place
    thrashing. Knowing the difference between the two has a range of
    applications, including measuring the impact of memory shortage on the
    system performance, as well as the ability to smarter balance pressure
    between the filesystem cache and the swap-backed workingset.

    During workingset transitions, inactive cache refaults and pushes out
    established active cache. When that active cache isn't stale, however,
    and also ends up refaulting, that's bonafide thrashing.

    Introduce a new page flag that tells on eviction whether the page has been
    active or not in its lifetime. This bit is then stored in the shadow
    entry, to classify refaults as transitioning or thrashing.

    How many page->flags does this leave us with on 32-bit?

    20 bits are always page flags

    21 if you have an MMU

    23 with the zone bits for DMA, Normal, HighMem, Movable

    29 with the sparsemem section bits

    30 if PAE is enabled

    31 with this patch.

    So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
    that's not enough, the system can switch to discontigmem and re-gain the 6
    or 7 sparsemem section bits.

    Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Oct, 2018

3 commits


13 Jun, 2018

1 commit

  • The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
    patch replaces cases of:

    kvzalloc(a * b, gfp)

    with:
    kvcalloc(a * b, gfp)

    as well as handling cases of:

    kvzalloc(a * b * c, gfp)

    with:

    kvzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvcalloc(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvzalloc
    + kvcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(sizeof(THING) * C2, ...)
    |
    kvzalloc(sizeof(TYPE) * C2, ...)
    |
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(C1 * C2, ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

1 commit

  • Patch series "mm, memcontrol: Implement memory.swap.events", v2.

    This patchset implements memory.swap.events which contains max and fail
    events so that userland can monitor and respond to swap running out.

    This patch (of 2):

    get_swap_page() is always followed by mem_cgroup_try_charge_swap().
    This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
    makes get_swap_page() call the function even after swap allocation
    failure.

    This simplifies the callers and consolidates memcg related logic and
    will ease adding swap related memcg events.

    Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Apr, 2018

3 commits

  • The bool enable_vma_readahead and swap_vma_readahead() are local to the
    source and do not need to be in global scope, so make them static.

    Cleans up sparse warnings:

    mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
    mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • This patch makes do_swap_page() not need to be aware of two different
    swap readahead algorithms. Just unify cluster-based and vma-based
    readahead function call.

    Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When I see recent change of swap readahead, I am very unhappy about
    current code structure which diverges two swap readahead algorithm in
    do_swap_page. This patch is to clean it up.

    Main motivation is that fault handler doesn't need to be aware of
    readahead algorithms but just should call swapin_readahead.

    As first step, this patch cleans up a little bit but not perfect (I just
    separate for review easier) so next patch will make the goal complete.

    [minchan@kernel.org: do not check readahead flag with THP anon]
    Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Nov, 2017

3 commits

  • All callers of release_pages claim the pages being released are cache
    hot. As no one cares about the hotness of pages being released to the
    allocator, just ditch the parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • These global variables are only set during initialization or rarely
    change, so declare them as __read_mostly.

    Link: http://lkml.kernel.org/r/1507802349-5554-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Changbin Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changbin Du
     
  • When a page fault occurs for a swap entry, the physical swap readahead
    (not the VMA base swap readahead) may readahead several swap entries
    after the fault swap entry. The readahead algorithm calculates some of
    the swap entries to readahead via increasing the offset of the fault
    swap entry without checking whether they are beyond the end of the swap
    device and it relys on the __swp_swapcount() and swapcache_prepare() to
    check it. Although __swp_swapcount() checks for the swap entry passed
    in, it will complain with the error message as follow for the expected
    invalid swap entry. This may make the end users confused.

    swap_info_get: Bad swap offset entry 0200f8a7

    To fix the false error message, the swap entry checking is added in
    swapin_readahead() to avoid to pass the out-of-bound swap entries and
    the swap entry reserved for the swap header to __swp_swapcount() and
    swapcache_prepare().

    Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
    Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
    Signed-off-by: "Huang, Ying"
    Reported-by: Christian Kujau
    Acked-by: Minchan Kim
    Suggested-by: Minchan Kim
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

1 commit

  • When the VMA based swap readahead was introduced, a new knob

    /sys/kernel/mm/swap/vma_ra_max_order

    was added as the max window of VMA swap readahead. This is to make it
    possible to use different max window for VMA based readahead and
    original physical readahead. But Minchan Kim pointed out that this will
    cause a regression because setting page-cluster sysctl to zero cannot
    disable swap readahead with the change.

    To fix the regression, the page-cluster sysctl is used as the max window
    of both the VMA based swap readahead and original physical swap
    readahead. If more fine grained control is needed in the future, more
    knobs can be added as the subordinate knobs of the page-cluster sysctl.

    The vma_ra_max_order knob is deleted. Because the knob was introduced
    in v4.14-rc1, and this patch is targeting being merged before v4.14
    releasing, there should be no existing users of this newly added ABI.

    Link: http://lkml.kernel.org/r/20171011070847.16003-1-ying.huang@intel.com
    Fixes: ec560175c0b6fce ("mm, swap: VMA based swap readahead")
    Signed-off-by: "Huang, Ying"
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

04 Oct, 2017

1 commit

  • MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
    SwapBacked). There is no lock to prevent the page is added to swap
    cache between these two steps by page reclaim. If page reclaim finds
    such page, it will simply add the page to swap cache without pageout the
    page to swap because the page is marked as clean. Next time, page fault
    will read data from the swap slot which doesn't have the original data,
    so we have a data corruption. To fix issue, we mark the page dirty and
    pageout the page.

    However, we shouldn't dirty all pages which is clean and in swap cache.
    swapin page is swap cache and clean too. So we only dirty page which is
    added into swap cache in page reclaim, which shouldn't be swapin page.
    As Minchan suggested, simply dirty the page in add_to_swap can do the
    job.

    Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
    Link: http://lkml.kernel.org/r/08c84256b007bf3f63c91d94383bd9eb6fee2daa.1506446061.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Reported-by: Artem Savkov
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Sep, 2017

4 commits

  • The sysfs interface to control the VMA based swap readahead is added as
    follow,

    /sys/kernel/mm/swap/vma_ra_enabled

    Enable the VMA based swap readahead algorithm, or use the original
    global swap readahead algorithm.

    /sys/kernel/mm/swap/vma_ra_max_order

    Set the max order of the readahead window size for the VMA based swap
    readahead algorithm.

    The corresponding ABI documentation is added too.

    Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory. And the different tasks in the system may have different access
    patterns, which makes the global space locality estimation incorrect.

    In this patch, when page fault occurs, the virtual pages near the fault
    address will be readahead instead of the swap slots near the fault swap
    slot in swap device. This avoid to readahead the unrelated swap slots.
    At the same time, the swap readahead is changed to work on per-VMA from
    globally. So that the different access patterns of the different VMAs
    could be distinguished, and the different readahead policy could be
    applied accordingly. The original core readahead detection and scaling
    algorithm is reused, because it is an effect algorithm to detect the
    space locality.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
    NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In the original implementation, it is possible that the existing pages
    in the swap cache (not newly readahead) could be marked as the readahead
    pages. This will cause the statistics of swap readahead be wrong and
    influence the swap readahead algorithm too.

    This is fixed via marking a page as the readahead page only if it is
    newly allocated and read from the disk.

    When testing with linpack, after the fixing the swap readahead hit rate
    increased from ~66% to ~86%.

    Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "mm, swap: VMA based swap readahead", v4.

    The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory space. And the different tasks in the system may have different
    access patterns, which makes the global space locality estimation
    incorrect.

    In this patchset, when page fault occurs, the virtual pages near the
    fault address will be readahead instead of the swap slots near the fault
    swap slot in swap device. This avoid to readahead the unrelated swap
    slots. At the same time, the swap readahead is changed to work on
    per-VMA from globally. So that the different access patterns of the
    different VMAs could be distinguished, and the different readahead
    policy could be applied accordingly. The original core readahead
    detection and scaling algorithm is reused, because it is an effect
    algorithm to detect the space locality.

    In addition to the swap readahead changes, some new sysfs interface is
    added to show the efficiency of the readahead algorithm and some other
    swap statistics.

    This new implementation will incur more small random read, on SSD, the
    improved correctness of estimation and readahead target should beat the
    potential increased overhead, this is also illustrated in the test
    results below. But on HDD, the overhead may beat the benefit, so the
    original implementation will be used by default.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
    Swap device: NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    This patch (of 5):

    The statistics for total readahead pages and total readahead hits are
    recorded and exported via the following sysfs interface.

    /sys/kernel/mm/swap/ra_hits
    /sys/kernel/mm/swap/ra_total

    With them, the efficiency of the swap readahead could be measured, so
    that the swap readahead algorithm and parameters could be tuned
    accordingly.

    [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
    Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

11 Jul, 2017

1 commit

  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Jul, 2017

3 commits

  • The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
    so if it fails due to lack of space in case of THP or something(hdd swap
    but tries THP swapout) *caller* rather than add_to_swap itself should
    split the THP page and retry it with base page which is more natural.

    Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 May, 2017

1 commit

  • Now vzalloc() is used in swap code to allocate various data structures,
    such as swap cache, swap slots cache, cluster info, etc. Because the
    size may be too large on some system, so that normal kzalloc() may fail.
    But using kzalloc() has some advantages, for example, less memory
    fragmentation, less TLB pressure, etc. So change the data structure
    allocation in swap code to use kvzalloc() which will try kzalloc()
    firstly, and fallback to vzalloc() if kzalloc() failed.

    In general, although kmalloc() will reduce the number of high-order
    pages in short term, vmalloc() will cause more pain for memory
    fragmentation in the long term. And the swap data structure allocation
    that is changed in this patch is expected to be long term allocation.

    From Dave Hansen:
    "for example, we have a two-page data structure. vmalloc() takes two
    effectively random order-0 pages, probably from two different 2M pages
    and pins them. That "kills" two 2M pages. kmalloc(), allocating two
    *contiguous* pages, will not cross a 2M boundary. That means it will
    only "kill" the possibility of a single 2M page. More 2M pages == less
    fragmentation.

    The allocation in this patch occurs during swap on time, which is
    usually done during system boot, so usually we have high opportunity to
    allocate the contiguous pages successfully.

    The allocation for swap_map[] in struct swap_info_struct is not changed,
    because that is usually quite large and vmalloc_to_page() is used for
    it. That makes it a little harder to change.

    Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

04 May, 2017

1 commit

  • Commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
    deadlock while waiting on discard I/O completion") fixed a deadlock in
    read_swap_cache_async(). Because at that time, in swap allocation path,
    a swap entry may be set as SWAP_HAS_CACHE, then wait for discarding to
    complete before the page for the swap entry is added to the swap cache.

    But in commit 815c2c543d3a ("swap: make swap discard async"), the
    discarding for swap become asynchronous, waiting for discarding to
    complete will be done before the swap entry is set as SWAP_HAS_CACHE.
    So the comments in code is incorrect now. This patch fixes the
    comments.

    The cond_resched() added in the commit cbab0e4eec29 is not necessary now
    too. But if we added some sleep in swap allocation path in the future,
    there may be some hard to debug/reproduce deadlock bug. So it is kept.

    Link: http://lkml.kernel.org/r/20170317064635.12792-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rafael Aquini
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

23 Feb, 2017

4 commits

  • Because during swap off, a swap entry may have swap_map[] ==
    SWAP_HAS_CACHE (for example, just allocated). If we return NULL in
    __read_swap_cache_async(), the swap off will abort. So when swap slot
    cache is disabled, (for swap off), we will wait for page to be put into
    swap cache in such race condition. This should not be a problem for swap
    slot cache, because swap slot cache should be drained after clearing
    swap_slot_cache_enabled.

    [ying.huang@intel.com: fix memory leak in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/874lzt6znd.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/5e2c5f6abe8e6eb0797408897b1bba80938e9b9d.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • We add per cpu caches for swap slots that can be allocated and freed
    quickly without the need to touch the swap info lock.

    Two separate caches are maintained for swap slots allocated and swap
    slots returned. This is to allow the swap slots to be returned to the
    global pool in a batch so they will have a chance to be coaelesced with
    other slots in a cluster. We do not reuse the slots that are returned
    right away, as it may increase fragmentation of the slots.

    The swap allocation cache is protected by a mutex as we may sleep when
    searching for empty slots in cache. The swap free cache is protected by
    a spin lock as we cannot sleep in the free path.

    We refill the swap slots cache when we run out of slots, and we disable
    the swap slots cache and drain the slots if the global number of slots
    fall below a low watermark threshold. We re-enable the cache agian when
    the slots available are above a high watermark.

    [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
    [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
    Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
    Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Michal Hocko
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • We can avoid needlessly allocating page for swap slots that are not used
    by anyone. No pages have to be read in for these slots.

    Link: http://lkml.kernel.org/r/0784b3f20b9bd3aa5552219624cb78dc4ae710c9.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • The patch is to improve the scalability of the swap out/in via using
    fine grained locks for the swap cache. In current kernel, one address
    space will be used for each swap device. And in the common
    configuration, the number of the swap device is very small (one is
    typical). This causes the heavy lock contention on the radix tree of
    the address space if multiple tasks swap out/in concurrently.

    But in fact, there is no dependency between pages in the swap cache. So
    that, we can split the one shared address space for each swap device
    into several address spaces to reduce the lock contention. In the
    patch, the shared address space is split into 64MB trunks. 64MB is
    chosen to balance the memory space usage and effect of lock contention
    reduction.

    The size of struct address_space on x86_64 architecture is 408B, so with
    the patch, 6528B more memory will be used for every 1GB swap space on
    x86_64 architecture.

    One address space is still shared for the swap entries in the same 64M
    trunks. To avoid lock contention for the first round of swap space
    allocation, the order of the swap clusters in the initial free clusters
    list is changed. The swap space distance between the consecutive swap
    clusters in the free cluster list is at least 64M. After the first
    round of allocation, the swap clusters are expected to be freed
    randomly, so the lock contention should be reduced effectively.

    Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Tim Chen
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang, Ying
     

08 Oct, 2016

1 commit

  • This patch is to improve the performance of swap cache operations when
    the type of the swap device is not 0. Originally, the whole swap entry
    value is used as the key of the swap cache, even though there is one
    radix tree for each swap device. If the type of the swap device is not
    0, the height of the radix tree of the swap cache will be increased
    unnecessary, especially on 64bit architecture. For example, for a 1GB
    swap device on the x86_64 architecture, the height of the radix tree of
    the swap cache is 11. But if the offset of the swap entry is used as
    the key of the swap cache, the height of the radix tree of the swap
    cache is 4. The increased height causes unnecessary radix tree
    descending and increased cache footprint.

    This patch reduces the height of the radix tree of the swap cache via
    using the offset of the swap entry instead of the whole swap entry value
    as the key of the swap cache. In 32 processes sequential swap out test
    case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
    for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
    when the type of the swap device is 1.

    Use the whole swap entry as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

    Use the swap offset as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

    Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying