20 Apr, 2022

1 commit

  • commit e914d8f00391520ecc4495dd0ca0124538ab7119 upstream.

    Two processes under CLONE_VM cloning, user process can be corrupted by
    seeing zeroed page unexpectedly.

    CPU A CPU B

    do_swap_page do_swap_page
    SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
    swap_readpage valid data
    swap_slot_free_notify
    delete zram entry
    swap_readpage zeroed(invalid) data
    pte_lock
    map the *zero data* to userspace
    pte_unlock
    pte_lock
    if (!pte_same)
    goto out_nomap;
    pte_unlock
    return and next refault will
    read zeroed data

    The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
    increase the refcount of swap slot at copy_mm so it couldn't catch up
    whether it's safe or not to discard data from backing device. In the
    case, only the lock it could rely on to synchronize swap slot freeing is
    page table lock. Thus, this patch gets rid of the swap_slot_free_notify
    function. With this patch, CPU A will see correct data.

    CPU A CPU B

    do_swap_page do_swap_page
    SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
    swap_readpage original data
    pte_lock
    map the original data
    swap_free
    swap_range_free
    bd_disk->fops->swap_slot_free_notify
    swap_readpage read zeroed data
    pte_unlock
    pte_lock
    if (!pte_same)
    goto out_nomap;
    pte_unlock
    return
    on next refault will see mapped data by CPU B

    The concern of the patch would increase memory consumption since it
    could keep wasted memory with compressed form in zram as well as
    uncompressed form in address space. However, most of cases of zram uses
    no readahead and do_swap_page is followed by swap_free so it will free
    the compressed form from in zram quickly.

    Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
    Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
    Reported-by: Ivan Babrou
    Tested-by: Ivan Babrou
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jens Axboe
    Cc: David Hildenbrand
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

03 Mar, 2021

1 commit

  • We're not factoring in the start of the file for where to write and
    read the swapfile, which leads to very unfortunate side effects of
    writing where we should not be...

    Fixes: 48d15436fde6 ("mm: remove get_swap_bio")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Feb, 2021

1 commit

  • If there are errors during swap read or write, they can easily fill the
    log buffer and remove any previous messages that might be useful for
    debugging, especially on systems that rely for logging only on the kernel
    ring-buffer.

    For example, on a systems using zram as swap, we are more likely to see
    any page allocation errors preceding the swap write errors if the alerts
    are ratelimited.

    Link: https://lkml.kernel.org/r/20210201142055.29068-1-georgi.djakov@linaro.org
    Signed-off-by: Georgi Djakov
    Acked-by: Minchan Kim
    Reviewed-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Georgi Djakov
     

28 Jan, 2021

1 commit

  • Just reuse the block_device and sector from the swap_info structure,
    just as used by the SWP_SYNCHRONOUS path. Also remove the checks for
    NULL returns from bio_alloc as that can't happen for sleeping
    allocations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Chaitanya Kulkarni
    Acked-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Jan, 2021

1 commit

  • Replace the gendisk pointer in struct bio with a pointer to the newly
    improved struct block device. From that the gendisk can be trivially
    accessed with an extra indirection, but it also allows to directly
    look up all information related to partition remapping.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Dec, 2020

1 commit

  • Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

    Currently a non-slab kernel page which has been charged to a memory cgroup
    can't be mapped to userspace. The underlying reason is simple: PageKmemcg
    flag is defined as a page type (like buddy, offline, etc), so it takes a
    bit from a page->mapped counter. Pages with a type set can't be mapped to
    userspace.

    But in general the kmemcg flag has nothing to do with mapping to
    userspace. It only means that the page has been accounted by the page
    allocator, so it has to be properly uncharged on release.

    Some bpf maps are mapping the vmalloc-based memory to userspace, and their
    memory can't be accounted because of this implementation detail.

    This patchset removes this limitation by moving the PageKmemcg flag into
    one of the free bits of the page->mem_cgroup pointer. Also it formalizes
    accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
    adds several checks and removes a couple of obsolete functions. As the
    result the code became more robust with fewer open-coded bit tricks.

    This patch (of 4):

    Currently there are many open-coded reads of the page->mem_cgroup pointer,
    as well as a couple of read helpers, which are barely used.

    It creates an obstacle on a way to reuse some bits of the pointer for
    storing additional bits of information. In fact, we already do this for
    slab pages, where the last bit indicates that a pointer has an attached
    vector of objcg pointers instead of a regular memcg pointer.

    This commits uses 2 existing helpers and introduces a new helper to
    converts all read sides to calls of these helpers:
    struct mem_cgroup *page_memcg(struct page *page);
    struct mem_cgroup *page_memcg_rcu(struct page *page);
    struct mem_cgroup *page_memcg_check(struct page *page);

    page_memcg_check() is intended to be used in cases when the page can be a
    slab page and have a memcg pointer pointing at objcg vector. It does
    check the lowest bit, and if set, returns NULL. page_memcg() contains a
    VM_BUG_ON_PAGE() check for the page not being a slab page.

    To make sure nobody uses a direct access, struct page's
    mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
    Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
    Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com

    Roman Gushchin
     

14 Oct, 2020

3 commits

  • The out label is only used in one place and return ret directly without
    something like resource cleanup or lock release and so on. So we should
    remove this jump label and do some cleanup.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200927124032.22521-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • SWP_FS is used to make swap_{read,write}page() go through the filesystem,
    and it's only used for swap files over NFS for now. Otherwise it will
    directly submit IO to blockdev according to swapfile extents reported by
    filesystems in advance.

    As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
    rename to SWP_FS_OPS.

    [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

    Suggested-by: Matthew Wilcox
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

25 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
    every physical page, when swapping pages out to disk it is necessary to
    save these tags, and later restore them when reading the pages back.

    Add some hooks along with dummy implementations to enable the
    arch code to handle this.

    Three new hooks are added to the swap code:
    * arch_prepare_to_swap() and
    * arch_swap_invalidate_page() / arch_swap_invalidate_area().
    One new hook is added to shmem:
    * arch_swap_restore()

    Signed-off-by: Steven Price
    [catalin.marinas@arm.com: add unlock_page() on the error path]
    [catalin.marinas@arm.com: dropped the _tags suffix]
    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steven Price
     

15 Aug, 2020

3 commits

  • struct swap_info_struct si.flags could be accessed concurrently as noticed
    by KCSAN,

    BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage

    write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1740/0x2820
    shrink_inactive_list+0x316/0x8b0
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
    swap_readpage+0x204/0x6a0
    swap_readpage at mm/page_io.c:380
    read_swap_cache_async+0xa2/0xb0
    swapin_readahead+0x6a0/0x890
    do_swap_page+0x465/0xeb0
    __handle_mm_fault+0xc7a/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 7 PID: 5422 Comm: gmain Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    Other reads,

    read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
    __swap_writepage+0x140/0xc20
    __swap_writepage at mm/page_io.c:289

    read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
    swap_set_page_dirty+0x44/0x1f4
    swap_set_page_dirty at mm/page_io.c:442

    The write is under &si->lock, but the reads are done as lockless. Since
    the reads only check for a specific bit in the flag, it is harmless even
    if load tearing happens. Thus, just mark them as intentional data races
    using the data_race() macro.

    [cai@lca.pw: add a missing annotation]
    Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This function returns the number of bytes in a THP. It is like
    page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
    is disabled.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

08 Aug, 2020

1 commit

  • swap_readpage() does the sync io for one page, the io is not big,
    normally, the io can be finished quickly, but it may take long time or
    wait forever in case of io failure or discard.

    This patch uses blk_io_schedule() instead of io_schedule() to avoid task
    hung and crash (when set /proc/sys/kernel/hung_task_panic) when the above
    exception occurs.

    This is similar to the hung task avoidance in submit_bio_wait(),
    blk_execute_rq() and __blkdev_direct_IO().

    Signed-off-by: Xianting Tian
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Ming Lei
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1596461807-21087-1-git-send-email-xianting_tian@126.com
    Signed-off-by: Linus Torvalds

    Xianting Tian
     

29 Jun, 2020

1 commit


10 Jun, 2020

1 commit

  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

02 Dec, 2019

1 commit

  • If a block device supports rw_page operation, it doesn't submit bios so
    the annotation in submit_bio() for refault stall doesn't work. It
    happens with zram in android, especially swap read path which could
    consume CPU cycle for decompress. It is also a problem for zswap which
    uses frontswap.

    Annotate swap_readpage() to account the synchronous IO overhead to
    prevent underreport memory pressure.

    [akpm@linux-foundation.org: add comment, per Johannes]
    Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Nov, 2019

1 commit

  • The following race is observed due to which a processes faulting on a
    swap entry, finds the page neither in swapcache nor swap. This causes
    zram to give a zero filled page that gets mapped to the process,
    resulting in a user space crash later.

    Consider parent and child processes Pa and Pb sharing the same swap slot
    with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
    Virtual address 'VA' of Pa and Pb points to the shared swap entry.

    Pa Pb

    fault on VA fault on VA
    do_swap_page do_swap_page
    lookup_swap_cache fails lookup_swap_cache fails
    Pb scheduled out
    swapin_readahead (deletes zram entry)
    swap_free (makes swap_count 1)
    Pb scheduled in
    swap_readpage (swap_count == 1)
    Takes SWP_SYNCHRONOUS_IO path
    zram enrty absent
    zram gives a zero filled page

    Fix this by making sure that swap slot is freed only when swap count
    drops down to one.

    Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
    Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
    Signed-off-by: Vinayak Menon
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     

13 Jul, 2019

1 commit

  • swap_extent is used to map swap page offset to backing device's block
    offset. For a continuous block range, one swap_extent is used and all
    these swap_extents are managed in a linked list.

    These swap_extents are used by map_swap_entry() during swap's read and
    write path. To find out the backing device's block offset for a page
    offset, the swap_extent list will be traversed linearly, with
    curr_swap_extent being used as a cache to speed up the search.

    This works well as long as swap_extents are not huge or when the number
    of processes that access swap device are few, but when the swap device
    has many extents and there are a number of processes accessing the swap
    device concurrently, it can be a problem. On one of our servers, the
    disk's remaining size is tight:

    $df -h
    Filesystem Size Used Avail Use% Mounted on
    ... ...
    /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4

    When creating a 80G swapfile there, there are as many as 84656 swap
    extents. The end result is, kernel spends abou 30% time in
    map_swap_entry() and swap throughput is only 70MB/s.

    As a comparison, when I used smaller sized swapfile, like 4G whose
    swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
    map_swap_entry() is about 3%.

    One downside of using rbtree for swap_extent is, 'struct rbtree' takes
    24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
    for each swap_extent. For a swapfile that has 80k swap_extents, that
    means 625KiB more memory consumed.

    Test:

    Since it's not possible to reboot that server, I can not test this patch
    diretly there. Instead, I tested it on another server with NVMe disk.

    I created a 20G swapfile on an NVMe backed XFS fs. By default, the
    filesystem is quite clean and the created swapfile has only 2 extents.
    Testing vanilla and this patch shows no obvious performance difference
    when swapfile is not fragmented.

    To see the patch's effects, I used some tweaks to manually fragment the
    swapfile by breaking the extent at 1M boundary. This made the swapfile
    have 20K extents.

    nr_task=4
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 165191 90.77% 171798 90.21%
    patched 858993 +420% 2.16% 715827 +317% 0.77%

    nr_task=8
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 306783 92.19% 318145 87.76%
    patched 954437 +211% 2.35% 1073741 +237% 1.57%

    swapout: the throughput of swap out, in KB/s, higher is better 1st
    map_swap_entry: cpu cycles percent sampled by perf swapin: the
    throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
    cpu cycles percent sampled by perf

    nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
    can be effectively used to cache the correct swap extent for single task
    workload.

    [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
    Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
    Signed-off-by: Aaron Lu
    Cc: Huang Ying
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

05 Jul, 2019

1 commit

  • swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
    this means that the caller can get the spurious wakeup after return.

    This can be fatal if blk_wake_io_task() does
    set_current_state(TASK_RUNNING) after the caller does
    set_special_state(), in the worst case the kernel can crash in
    do_task_dead().

    Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
    Fixes: 0619317ff8baa2d ("block: add polled wakeup task helper")
    Signed-off-by: Oleg Nesterov
    Reported-by: Qian Cai
    Acked-by: Hugh Dickins
    Reviewed-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Jun, 2019

1 commit

  • 0-Day test system reported some OOM regressions for several THP
    (Transparent Huge Page) swap test cases. These regressions are bisected
    to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the
    commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
    bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
    THP. That causes the OOM.

    As in the patch description of 6861428921b5 ("block: always define
    BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
    to swap space. So the issue is fixed via doing that in get_swap_bio().

    BTW: I remember I have checked the THP swap code when 6861428921b5
    ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
    THP swap code needn't to be changed. But apparently, I was wrong. I
    should have done this at that time.

    Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
    Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Ming Lei
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Daniel Jordan
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

05 Jan, 2019

1 commit

  • swap_readpage() wants to do polling to bring in pages if asked to, but
    it doesn't mark the bio as being polled. Additionally, the looping
    around the blk_poll() check isn't correct - if we get a zero return, we
    should call io_schedule(), we can't just assume that the bio has
    completed. The regular bio->bi_private check should be used for that.

    Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Andrew Morton
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

03 Jan, 2019

1 commit

  • This mostly reverts commit 849a370016a5 ("block: avoid ordered task
    state change for polled IO"). It was wrongly claiming that the ordering
    wasn't necessary. The memory barrier _is_ necessary.

    If something is truly polling and not going to sleep, it's the whole
    state setting that is unnecessary, not the memory barrier. Whenever you
    set your state to a sleeping state, you absolutely need the memory
    barrier.

    Note that sometimes the memory barrier can be elsewhere. For example,
    the ordering might be provided by an external lock, or by setting the
    process state to sleeping before adding yourself to the wait queue list
    that is used for waking up (where the wait queue lock itself will
    guarantee that any wakeup will correctly see the sleeping state).

    But none of those cases were true here.

    NOTE! Some of the polling paths may indeed be able to drop the state
    setting entirely, at which point the memory barrier also goes away.

    (Also note that this doesn't revert the TASK_RUNNING cases: there is no
    race between a wakeup and setting the process state to TASK_RUNNING,
    since the end result doesn't depend on ordering).

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Dec, 2018

1 commit

  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

26 Nov, 2018

1 commit

  • blk_poll() has always kept spinning until it found an IO. This is
    fine for SYNC polling, since we need to find one request we have
    pending, but in preparation for ASYNC polling it can be beneficial
    to just check if we have any entries available or not.

    Existing callers are converted to pass in 'spin == true', to retain
    the old behavior.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Nov, 2018

1 commit

  • For the core poll helper, the task state setting don't need to imply any
    atomics, as it's the current task itself that is being modified and
    we're not going to sleep.

    For IRQ driven, the wakeup path have the necessary barriers to not need
    us using the heavy handed version of the task state setting.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2018

1 commit

  • If we're polling for IO on a device that doesn't use interrupts, then
    IO completion loop (and wake of task) is done by submitting task itself.
    If that is the case, then we don't need to enter the wake_up_process()
    function, we can simply mark ourselves as TASK_RUNNING.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

2 commits

  • Pull AFS updates from Al Viro:
    "AFS series, with some iov_iter bits included"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    missing bits of "iov_iter: Separate type from direction and use accessor functions"
    afs: Probe multiple fileservers simultaneously
    afs: Fix callback handling
    afs: Eliminate the address pointer from the address list cursor
    afs: Allow dumping of server cursor on operation failure
    afs: Implement YFS support in the fs client
    afs: Expand data structure fields to support YFS
    afs: Get the target vnode in afs_rmdir() and get a callback on it
    afs: Calc callback expiry in op reply delivery
    afs: Fix FS.FetchStatus delivery from updating wrong vnode
    afs: Implement the YFS cache manager service
    afs: Remove callback details from afs_callback_break struct
    afs: Commit the status on a new file/dir/symlink
    afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
    afs: Don't invoke the server to read data beyond EOF
    afs: Add a couple of tracepoints to log I/O errors
    afs: Handle EIO from delivery function
    afs: Fix TTL on VL server and address lists
    afs: Implement VL server rotation
    afs: Improve FS server rotation error handling
    ...

    Linus Torvalds
     
  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

27 Oct, 2018

1 commit

  • The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
    through the filesystem, and to make swapoff() call ->swap_deactivate().
    For Btrfs, we want the latter but not the former, so split this flag into
    two. This makes us always call ->swap_deactivate() if ->swap_activate()
    succeeded, not just if it didn't add any swap extents itself.

    This also resolves the issue of the very misleading name of SWP_FILE,
    which is only used for swap files over NFS.

    Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: David Sterba
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

24 Oct, 2018

1 commit

  • In the iov_iter struct, separate the iterator type from the iterator
    direction and use accessor functions to access them in most places.

    Convert a bunch of places to use switch-statements to access them rather
    then chains of bitwise-AND statements. This makes it easier to add further
    iterator types. Also, this can be more efficient as to implement a switch
    of small contiguous integers, the compiler can use ~50% fewer compare
    instructions than it has to use bitwise-and instructions.

    Further, cease passing the iterator type into the iterator setup function.
    The iterator function can set that itself. Only the direction is required.

    Signed-off-by: David Howells

    David Howells
     

22 Sep, 2018

1 commit

  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

09 Jul, 2018

2 commits

  • For backcharging we need to know who the page belongs to when swapping
    it out. We don't worry about things that do ->rw_page (zram etc) at the
    moment, we're only worried about pages that actually go to a block
    device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Just like REQ_META, it's important to know the IO coming down is swap
    in order to guard against potential IO priority inversion issues with
    cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
    bio_issue_as_root_blkg helper.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

07 Jan, 2018

1 commit


16 Nov, 2017

1 commit

  • With fast swap storage, the platforms want to use swap more aggressively
    and swap-in is crucial to application latency.

    The rw_page() based synchronous devices like zram, pmem and btt are such
    fast storage. When I profile swapin performance with zram lz4
    decompress test, S/W overhead is more than 70%. Maybe, it would be
    bigger in nvdimm.

    This patch aims to reduce swap-in latency by skipping swapcache if the
    swap device is synchronous device like rw_page based device. It
    enhances 45% my swapin test(5G sequential swapin, no readahead, from
    2.41sec to 1.64sec).

    Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds