20 Aug, 2019

2 commits


13 Jul, 2019

2 commits

  • swap_extent is used to map swap page offset to backing device's block
    offset. For a continuous block range, one swap_extent is used and all
    these swap_extents are managed in a linked list.

    These swap_extents are used by map_swap_entry() during swap's read and
    write path. To find out the backing device's block offset for a page
    offset, the swap_extent list will be traversed linearly, with
    curr_swap_extent being used as a cache to speed up the search.

    This works well as long as swap_extents are not huge or when the number
    of processes that access swap device are few, but when the swap device
    has many extents and there are a number of processes accessing the swap
    device concurrently, it can be a problem. On one of our servers, the
    disk's remaining size is tight:

    $df -h
    Filesystem Size Used Avail Use% Mounted on
    ... ...
    /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4

    When creating a 80G swapfile there, there are as many as 84656 swap
    extents. The end result is, kernel spends abou 30% time in
    map_swap_entry() and swap throughput is only 70MB/s.

    As a comparison, when I used smaller sized swapfile, like 4G whose
    swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
    map_swap_entry() is about 3%.

    One downside of using rbtree for swap_extent is, 'struct rbtree' takes
    24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
    for each swap_extent. For a swapfile that has 80k swap_extents, that
    means 625KiB more memory consumed.

    Test:

    Since it's not possible to reboot that server, I can not test this patch
    diretly there. Instead, I tested it on another server with NVMe disk.

    I created a 20G swapfile on an NVMe backed XFS fs. By default, the
    filesystem is quite clean and the created swapfile has only 2 extents.
    Testing vanilla and this patch shows no obvious performance difference
    when swapfile is not fragmented.

    To see the patch's effects, I used some tweaks to manually fragment the
    swapfile by breaking the extent at 1M boundary. This made the swapfile
    have 20K extents.

    nr_task=4
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 165191 90.77% 171798 90.21%
    patched 858993 +420% 2.16% 715827 +317% 0.77%

    nr_task=8
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 306783 92.19% 318145 87.76%
    patched 954437 +211% 2.35% 1073741 +237% 1.57%

    swapout: the throughput of swap out, in KB/s, higher is better 1st
    map_swap_entry: cpu cycles percent sampled by perf swapin: the
    throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
    cpu cycles percent sampled by perf

    nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
    can be effectively used to cache the correct swap extent for single task
    workload.

    [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
    Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
    Signed-off-by: Aaron Lu
    Cc: Huang Ying
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • When swapin is performed, after getting the swap entry information from
    the page table, system will swap in the swap entry, without any lock held
    to prevent the swap device from being swapoff. This may cause the race
    like below,

    CPU 1 CPU 2
    ----- -----
    do_swap_page
    swapin_readahead
    __read_swap_cache_async
    swapoff swapcache_prepare
    p->swap_map = NULL __swap_duplicate
    p->swap_map[?] /* !!! NULL pointer access */

    Because swapoff is usually done when system shutdown only, the race may
    not hit many people in practice. But it is still a race need to be fixed.

    To fix the race, get_swap_device() is added to check whether the specified
    swap entry is valid in its swap device. If so, it will keep the swap
    entry valid via preventing the swap device from being swapoff, until
    put_swap_device() is called.

    Because swapoff() is very rare code path, to make the normal path runs as
    fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
    reference count is used to implement get/put_swap_device(). >From
    get_swap_device() to put_swap_device(), RCU reader side is locked, so
    synchronize_rcu() in swapoff() will wait until put_swap_device() is
    called.

    In addition to swap_map, cluster_info, etc. data structure in the struct
    swap_info_struct, the swap cache radix tree will be freed after swapoff,
    so this patch fixes the race between swap cache looking up and swapoff
    too.

    Races between some other swap cache usages and swapoff are fixed too via
    calling synchronize_rcu() between clearing PageSwapCache() and freeing
    swap cache data structure.

    Another possible method to fix this is to use preempt_off() +
    stop_machine() to prevent the swap device from being swapoff when its data
    structure is being accessed. The overhead in hot-path of both methods is
    similar. The advantages of RCU based method are,

    1. stop_machine() may disturb the normal execution code path on other
    CPUs.

    2. File cache uses RCU to protect its radix tree. If the similar
    mechanism is used for swap cache too, it is easier to share code
    between them.

    3. RCU is used to protect swap cache in total_swapcache_pages() and
    exit_swap_address_space() already. The two mechanisms can be
    merged to simplify the logic.

    Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Not-nacked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Paul E. McKenney
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 Apr, 2019

3 commits

  • The igrab() in shmem_unuse() looks good, but we forgot that it gives no
    protection against concurrent unmounting: a point made by Konstantin
    Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
    ("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
    swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
    Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

    Once again, give up on using igrab(); but don't go back to making such
    heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
    the new design, and I expect could deadlock inside shmem_swapin_page().

    Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
    specific inode, and shmem_evict_inode() wait for that to go down to 0.
    Call it "stop_eviction" rather than "swapoff_busy" because it can be put
    to use for others later (huge tmpfs patches expect to use it).

    That simplifies shmem_unuse(), protecting it from both unlink and
    unmount; and in practice lets it locate all the swap in its first try.
    But do not rely on that: there's still a theoretical case, when
    shmem_writepage() might have been preempted after its get_swap_page(),
    before making the swap entry visible to swapoff.

    [hughd@google.com: remove incorrect list_del()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The old try_to_unuse() implementation was driven by find_next_to_unuse(),
    which terminated as soon as all the swap had been freed.

    Add inuse_pages checks now (alongside signal_pending()) to stop scanning
    mms and swap_map once finished.

    The same ought to be done in shmem_unuse() too, but never was before,
    and needs a different interface: so leave it as is for now.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081258200.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
    further testing has proved it to be a source of unnecessary swapoff
    EBUSY failures (which can then be followed by unmount EBUSY failures).

    When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
    or inode being evicted, freeing up swap independent of try_to_unuse().
    Those typically completed much sooner than the old quadratic swapoff,
    but now it's more common that swapoff may need to wait for them.

    It's possible to move those cases from init_mm.mmlist and shmem_swaplist
    to separate "exiting" swaplists, and try_to_unuse() then wait for those
    lists to be emptied; but we've not bothered with that in the past, and
    don't want to risk missing some other forgotten case. So just revert to
    cycling around until the swap is gone, without any retries limit.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Mar, 2019

4 commits

  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    struct boo entry[];
    };

    size = sizeof(struct foo) + count * sizeof(struct boo);
    instance = kvzalloc(size, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);

    Notice that, in this case, variable size is not necessary, hence it is
    removed.

    This code was detected with the help of Coccinelle.

    Link: http://lkml.kernel.org/r/20190221154622.GA19599@embeddedor
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Dan Carpenter reports a potential NULL dereference in
    get_swap_page_of_type:

    Smatch complains that the NULL checks on "si" aren't consistent. This
    seems like a real bug because we have not ensured that the type is
    valid and so "si" can be NULL.

    Add the missing check for NULL, taking care to use a read barrier to
    ensure CPU1 observes CPU0's updates in the correct order:

    CPU0 CPU1
    alloc_swap_info() if (type >= nr_swapfiles)
    swap_info[type] = p /* handle invalid entry */
    smp_wmb() smp_rmb()
    ++nr_swapfiles p = swap_info[type]

    Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
    CPU0's write to swap_info[type] and read NULL from swap_info[type].

    Ying Huang noticed other places in swapfile.c don't order these reads
    properly. Introduce swap_type_to_swap_info to encourage correct usage.

    Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
    (see tools/memory-model/Documentation/explanation.txt).

    This ordering need not be enforced in places where swap_lock is held
    (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
    and the swap_info array.

    Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
    Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
    Signed-off-by: Daniel Jordan
    Reported-by: Dan Carpenter
    Suggested-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Acked-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Omar Sandoval
    Cc: Paul McKenney
    Cc: Shaohua Li
    Cc: Stephen Rothwell
    Cc: Tejun Heo
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     
  • This patch was initially posted by Kelley Nielsen. Reposting the patch
    with all review comments addressed and with minor modifications and
    optimizations. Also, folding in the fixes offered by Hugh Dickins and
    Huang Ying. Tests were rerun and commit message updated with new
    results.

    try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
    It unuses swap entries one by one, potentially iterating over all the
    page tables for all the processes in the system for each one.

    This new proposed implementation of try_to_unuse simplifies its
    complexity to linear. It iterates over the system's mms once, unusing
    all the affected entries as it walks each set of page tables. It also
    makes similar changes to shmem_unuse.

    Improvement

    swapoff was called on a swap partition containing about 6G of data, in a
    VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

    Present implementation....about 1200M calls(8min, avg 80% cpu util).
    Prototype.................about 9.0K calls(3min, avg 5% cpu util).

    Details

    In shmem_unuse(), iterate over the shmem_swaplist and, for each
    shmem_inode_info that contains a swap entry, pass it to
    shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
    iterate over its associated xarray, and store the index and value of
    each swap entry in an array for passing to shmem_swapin_page() outside
    of the RCU critical section.

    In try_to_unuse(), instead of iterating over the entries in the type and
    unusing them one by one, perhaps walking all the page tables for all the
    processes for each one, iterate over the mmlist, making one pass. Pass
    each mm to unuse_mm() to begin its page table walk, and during the walk,
    unuse all the ptes that have backing store in the swap type received by
    try_to_unuse(). After the walk, check the type for orphaned swap
    entries with find_next_to_unuse(), and remove them from the swap cache.
    If find_next_to_unuse() starts over at the beginning of the type, repeat
    the check of the shmem_swaplist and the walk a maximum of three times.

    Change unuse_mm() and the intervening walk functions down to
    unuse_pte_range() to take the type as a parameter, and to iterate over
    their entire range, calling the next function down on every iteration.
    In unuse_pte_range(), make a swap entry from each pte in the range using
    the passed in type. If it has backing store in the type, call
    swapin_readahead() to retrieve the page and pass it to unuse_pte().

    Pass the count of pages_to_unuse down the page table walks in
    try_to_unuse(), and return from the walk when the desired number of
    pages has been swapped back in.

    Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
    Signed-off-by: Vineeth Remanan Pillai
    Signed-off-by: Kelley Nielsen
    Signed-off-by: Huang Ying
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineeth Remanan Pillai
     

29 Dec, 2018

2 commits

  • KSM pages may be mapped to the multiple VMAs that cannot be reached from
    one anon_vma. So during swapin, a new copy of the page need to be
    generated if a different anon_vma is needed, please refer to comments of
    ksm_might_need_to_copy() for details.

    During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
    virtual address mapped to the page, so not all mappings to a swapped out
    KSM page could be found. So in try_to_unuse(), even if the swap count of
    a swap entry isn't zero, the page needs to be deleted from swap cache, so
    that, in the next round a new page could be allocated and swapin for the
    other mappings of the swapped out KSM page.

    But this contradicts with the THP swap support. Where the THP could be
    deleted from swap cache only after the swap count of every swap entry in
    the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
    changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
    space for THP swapped out") to check that before delete a page from swap
    cache, but this has broken KSM swapoff too.

    Fortunately, KSM is for the normal pages only, so the original behavior
    for KSM pages could be restored easily via checking PageTransCompound().
    That is how this patch works.

    The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
    swap space for THP swapped out"), which is merged by v4.14-rc1. So I
    think we should backport the fix to from 4.14 on. But Hugh thinks it may
    be rare for the KSM pages being in the swap device when swapoff, so nobody
    reports the bug so far.

    Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
    Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
    Signed-off-by: "Huang, Ying"
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Daniel Jordan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
    avail_lists field of swap_info_struct is changed to an array with
    MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB
    and needs an order-4 page to hold it.

    This is not optimal in that:
    1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
    is a waste of memory;
    2 It could cause swapon failure if the swap device is swapped on
    after system has been running for a while, due to no order-4
    page is available as pointed out by Vasily Averin.

    Solve the above two issues by using nr_node_ids(which is the actual
    possible node number the running system has) for avail_lists instead of
    MAX_NUMNODES.

    nr_node_ids is unknown at compile time so can't be directly used when
    declaring this array. What I did here is to declare avail_lists as zero
    element array and allocate space for it when allocating space for
    swap_info_struct. The reason why keep using array but not pointer is
    plist_for_each_entry needs the field to be part of the struct, so pointer
    will not work.

    This patch is on top of Vasily Averin's fix commit. I think the use of
    kvzalloc for swap_info_struct is still needed in case nr_node_ids is
    really big on some systems.

    Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
    Signed-off-by: Aaron Lu
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vasily Averin
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

19 Nov, 2018

1 commit

  • Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
    changed 'avail_lists' field of 'struct swap_info_struct' to an array.
    In popular linux distros it increased size of swap_info_struct up to 40
    Kbytes and now swap_info_struct allocation requires order-4 page.
    Switch to kvzmalloc allows to avoid unexpected allocation failures.

    Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
    Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
    Signed-off-by: Vasily Averin
    Acked-by: Aaron Lu
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

27 Oct, 2018

5 commits

  • Btrfs currently does not support swap files because swap's use of bmap
    does not work with copy-on-write and multiple devices. See 35054394c4b3
    ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").

    However, the swap code has a mechanism for the filesystem to manually add
    swap extents using add_swap_extent() from the ->swap_activate() aop.
    iomap has done this since 67482129cdab ("iomap: add a swapfile activation
    function"). Btrfs will do the same in a later patch, so export
    add_swap_extent().

    Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Cc: David Sterba
    Cc: Johannes Weiner
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
    through the filesystem, and to make swapoff() call ->swap_deactivate().
    For Btrfs, we want the latter but not the former, so split this flag into
    two. This makes us always call ->swap_deactivate() if ->swap_activate()
    succeeded, not just if it didn't add any swap extents itself.

    This also resolves the issue of the very misleading name of SWP_FILE,
    which is only used for swap files over NFS.

    Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: David Sterba
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • si->swap_map[] of the swap entries in cluster needs to be cleared during
    freeing. Previously, this is done in the caller of swap_free_cluster().
    This may cause code duplication (one user now, will add more users later)
    and lock/unlock cluster unnecessarily. In this patch, the clearing code
    is moved to swap_free_cluster() to avoid the downside.

    Link: http://lkml.kernel.org/r/20180827075535.17406-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This is a code cleanup patch without functionality change.

    Originally, when __swap_entry_free() is called, and its return value is 0,
    free_swap_slot() will always be called to free the swap entry to the
    per-CPU pool. So move the call to free_swap_slot() to __swap_entry_free()
    to simplify the code.

    Link: http://lkml.kernel.org/r/20180827075535.17406-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The code path to reclaim the swap entry in free_swap_and_cache() is
    almost same as that of __try_to_reclaim_swap(). The largest
    difference is just coding style. So the support to the additional
    requirement of free_swap_and_cache() is added into
    __try_to_reclaim_swap(). free_swap_and_cache() is changed to call
    __try_to_reclaim_swap(), and delete the duplicated code. This will
    improve code readability and reduce the potential bugs.

    There are 2 functionality differences between __try_to_reclaim_swap()
    and swap entry reclaim code of free_swap_and_cache().

    - free_swap_and_cache() only reclaims the swap entry if the page is
    unmapped or swap is getting full. The support has been added into
    __try_to_reclaim_swap().

    - try_to_free_swap() (called by __try_to_reclaim_swap()) checks
    pm_suspended_storage(), while free_swap_and_cache() not. I think
    this is OK. Because the page and the swap entry can be reclaimed
    later eventually.

    Link: http://lkml.kernel.org/r/20180827075535.17406-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

23 Aug, 2018

8 commits

  • In this patch, locking related code is shared between huge/normal code
    path in put_swap_page() to reduce code duplication. The `free_entries == 0`
    case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
    case, because the new locking method makes it easy.

    The added lines is same as the removed lines. But the code size is
    increased when CONFIG_TRANSPARENT_HUGEPAGE=n.

    text data bss dec hex filename
    base: 24123 2004 340 26467 6763 mm/swapfile.o
    unified: 24485 2004 340 26829 68cd mm/swapfile.o

    Dig on step deeper with `size -A mm/swapfile.o` for base and unified
    kernel and compare the result, yields,

    -.text 17723 0
    +.text 17835 0
    -.orc_unwind_ip 1380 0
    +.orc_unwind_ip 1480 0
    -.orc_unwind 2070 0
    +.orc_unwind 2220 0
    -Total 26686
    +Total 27048

    The total difference is the same. The text segment difference is much
    smaller: 112. More difference comes from the ORC unwinder segments:
    (1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is
    used, this costs nothing.

    Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The part of __swap_entry_free() with lock held is separated into a new
    function __swap_entry_free_locked(). Because we want to reuse that
    piece of code in some other places.

    Just mechanical code refactoring, there is no any functional change in
    this function.

    Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • As suggested by Matthew Wilcox, it is better to use "int entry_size"
    instead of "bool cluster" as parameter to specify whether to operate for
    huge or normal swap entries. Because this improve the flexibility to
    support other swap entry size. And Dave Hansen thinks that this
    improves code readability too.

    So in this patch, the "bool cluster" parameter of get_swap_pages() is
    replaced by "int entry_size".

    And nr_swap_entries() trick is used to reduce the binary size when
    !CONFIG_TRANSPARENT_HUGE_PAGE.

    text data bss dec hex filename
    base 24215 2028 340 26583 67d7 mm/swapfile.o
    head 24123 2004 340 26467 6763 mm/swapfile.o

    Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Matthew Wilcox
    Acked-by: Dave Hansen
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In this patch, the normal/huge code path in put_swap_page() and several
    helper functions are unified to avoid duplicated code, bugs, etc. and
    make it easier to review the code.

    The removed lines are more than added lines. And the binary size is
    kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.

    Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Dave Hansen
    Acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • As suggested by Dave, we should unify the code path for normal and huge
    swap support if possible to avoid duplicated code, bugs, etc. and make
    it easier to review code.

    In this patch, the normal/huge code path in
    swap_page_trans_huge_swapped() is unified, the added and removed lines
    are same. And the binary size is kept almost same when
    CONFIG_TRANSPARENT_HUGEPAGE=n.

    text data bss dec hex filename
    base: 24179 2028 340 26547 67b3 mm/swapfile.o
    unified: 24215 2028 340 26583 67d7 mm/swapfile.o

    Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In swap_page_trans_huge_swapped(), to identify whether there's any page
    table mapping for a 4k sized swap entry, "si->swap_map[i] !=
    SWAP_HAS_CACHE" is used. This works correctly now, because all users of
    the function will only call it after checking SWAP_HAS_CACHE. But as
    pointed out by Daniel, it is better to use "swap_count(map[i])" here,
    because it works for "map[i] == 0" case too.

    And this makes the implementation more consistent between normal and
    huge swap entry.

    Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is
    enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when
    THP isn't enabled. But #ifdef/#endif in .c file hurt the code
    readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP)
    instead and let compiler to do the dirty job for us. This has potential
    to remove some duplicated code too. From output of `size`,

    text data bss dec hex filename
    THP=y: 26269 2076 340 28685 700d mm/swapfile.o
    ifdef/endif: 24115 2028 340 26483 6773 mm/swapfile.o
    IS_ENABLED: 24179 2028 340 26547 67b3 mm/swapfile.o

    IS_ENABLED() based solution works quite well, almost as good as that of
    #ifdef/#endif. And from the diffstat, the removed lines are more than
    added lines.

    One #ifdef for split_swap_cluster() is kept. Because it is a public
    function with a stub implementation for CONFIG_THP_SWAP=n in swap.h.

    Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "swap: THP optimizing refactoring", v4.

    Now the THP (Transparent Huge Page) swap optimizing is implemented in the
    way like below,

    #ifdef CONFIG_THP_SWAP
    huge_function(...)
    {
    }
    #else
    normal_function(...)
    {
    }
    #endif

    general_function(...)
    {
    if (huge)
    return thp_function(...);
    else
    return normal_function(...);
    }

    As pointed out by Dave Hansen, this will,

    1. Create a new, wholly untested code path for huge page
    2. Create two places to patch bugs
    3. Are not reusing code when possible

    This patchset is to address these problems via merging huge/normal code
    path/functions if possible.

    One concern is that this may cause code size to dilate when
    !CONFIG_TRANSPARENT_HUGEPAGE. The data shows that most refactoring will
    only cause quite slight code size increase.

    This patch (of 8):

    To improve code readability.

    Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-and-acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

09 Jul, 2018

1 commit

  • Memory allocations can induce swapping via kswapd or direct reclaim. If
    we are having IO done for us by kswapd and don't actually go into direct
    reclaim we may never get scheduled for throttling. So instead check to
    see if our cgroup is congested, and if so schedule the throttling.
    Before we return to user space the throttling stuff will only throttle
    if we actually required it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Jun, 2018

1 commit

  • For the L1TF workaround its necessary to limit the swap file size to below
    MAX_PA/2, so that the higher bits of the swap offset inverted never point
    to valid memory.

    Add a mechanism for the architecture to override the swap file size check
    in swapfile.c and add a x86 specific max swapfile check function that
    enforces that limit.

    The check is only enabled if the CPU is vulnerable to L1TF.

    In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
    with 46bit PA it is 32TB. The limit is only per individual swap file, so
    it's always possible to exceed these limits with multiple swap files or
    partitions.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen

    Andi Kleen
     

15 Jun, 2018

1 commit

  • Commit 570a335b8e22 ("swap_info: swap count continuations") introduces
    COUNT_CONTINUED but refers to it incorrectly as SWAP_HAS_CONT in a
    comment in swap_count. Fix it.

    Link: http://lkml.kernel.org/r/20180612175919.30413-1-daniel.m.jordan@oracle.com
    Fixes: 570a335b8e22 ("swap_info: swap count continuations")
    Signed-off-by: Daniel Jordan
    Reviewed-by: Andrew Morton
    Cc: "Huang, Ying"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

13 Jun, 2018

1 commit

  • The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
    patch replaces cases of:

    kvzalloc(a * b, gfp)

    with:
    kvcalloc(a * b, gfp)

    as well as handling cases of:

    kvzalloc(a * b * c, gfp)

    with:

    kvzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvcalloc(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvzalloc
    + kvcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvzalloc(sizeof(THING) * C2, ...)
    |
    kvzalloc(sizeof(TYPE) * C2, ...)
    |
    kvzalloc(C1 * C2 * C3, ...)
    |
    kvzalloc(C1 * C2, ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvzalloc
    + kvcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

26 May, 2018

1 commit

  • If swapon() fails after incrementing nr_rotate_swap, we don't decrement
    it and thus effectively leak it. Make sure we decrement it if we
    incremented it.

    Link: http://lkml.kernel.org/r/b6fe6b879f17fa68eee6cbd876f459f6e5e33495.1526491581.git.osandov@fb.com
    Fixes: 81a0298bdfab ("mm, swap: don't use VMA based swap readahead if HDD is used as swap")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Rik van Riel
    Reviewed-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

12 Apr, 2018

2 commits

  • The pointer swap_avail_heads is local to the source and does not need to
    be in global scope, so make it static.

    Cleans up sparse warning:

    mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180206215836.12366-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Abraham
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit


16 Nov, 2017

2 commits

  • When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several
    processes, it can cause unnecessary memory wastage by skipping swap
    cache. Because, with swapin fault by read, they could share a page if
    the page were in swap cache. Thus, it avoids allocating same content
    new pages.

    This patch makes the swapcache skipping work only if the swap pte is
    non-sharable.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1507620825-5537-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With fast swap storage, the platforms want to use swap more aggressively
    and swap-in is crucial to application latency.

    The rw_page() based synchronous devices like zram, pmem and btt are such
    fast storage. When I profile swapin performance with zram lz4
    decompress test, S/W overhead is more than 70%. Maybe, it would be
    bigger in nvdimm.

    This patch aims to reduce swap-in latency by skipping swapcache if the
    swap device is synchronous device like rw_page based device. It
    enhances 45% my swapin test(5G sequential swapin, no readahead, from
    2.41sec to 1.64sec).

    Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim