14 Oct, 2020

3 commits

  • The out label is only used in one place and return ret directly without
    something like resource cleanup or lock release and so on. So we should
    remove this jump label and do some cleanup.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200927124032.22521-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • SWP_FS is used to make swap_{read,write}page() go through the filesystem,
    and it's only used for swap files over NFS for now. Otherwise it will
    directly submit IO to blockdev according to swapfile extents reported by
    filesystems in advance.

    As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
    rename to SWP_FS_OPS.

    [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

    Suggested-by: Matthew Wilcox
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

25 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
    every physical page, when swapping pages out to disk it is necessary to
    save these tags, and later restore them when reading the pages back.

    Add some hooks along with dummy implementations to enable the
    arch code to handle this.

    Three new hooks are added to the swap code:
    * arch_prepare_to_swap() and
    * arch_swap_invalidate_page() / arch_swap_invalidate_area().
    One new hook is added to shmem:
    * arch_swap_restore()

    Signed-off-by: Steven Price
    [catalin.marinas@arm.com: add unlock_page() on the error path]
    [catalin.marinas@arm.com: dropped the _tags suffix]
    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steven Price
     

15 Aug, 2020

3 commits

  • struct swap_info_struct si.flags could be accessed concurrently as noticed
    by KCSAN,

    BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage

    write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1740/0x2820
    shrink_inactive_list+0x316/0x8b0
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
    swap_readpage+0x204/0x6a0
    swap_readpage at mm/page_io.c:380
    read_swap_cache_async+0xa2/0xb0
    swapin_readahead+0x6a0/0x890
    do_swap_page+0x465/0xeb0
    __handle_mm_fault+0xc7a/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 7 PID: 5422 Comm: gmain Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    Other reads,

    read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
    __swap_writepage+0x140/0xc20
    __swap_writepage at mm/page_io.c:289

    read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
    swap_set_page_dirty+0x44/0x1f4
    swap_set_page_dirty at mm/page_io.c:442

    The write is under &si->lock, but the reads are done as lockless. Since
    the reads only check for a specific bit in the flag, it is harmless even
    if load tearing happens. Thus, just mark them as intentional data races
    using the data_race() macro.

    [cai@lca.pw: add a missing annotation]
    Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This function returns the number of bytes in a THP. It is like
    page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
    is disabled.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

08 Aug, 2020

1 commit

  • swap_readpage() does the sync io for one page, the io is not big,
    normally, the io can be finished quickly, but it may take long time or
    wait forever in case of io failure or discard.

    This patch uses blk_io_schedule() instead of io_schedule() to avoid task
    hung and crash (when set /proc/sys/kernel/hung_task_panic) when the above
    exception occurs.

    This is similar to the hung task avoidance in submit_bio_wait(),
    blk_execute_rq() and __blkdev_direct_IO().

    Signed-off-by: Xianting Tian
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Ming Lei
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1596461807-21087-1-git-send-email-xianting_tian@126.com
    Signed-off-by: Linus Torvalds

    Xianting Tian
     

29 Jun, 2020

1 commit


10 Jun, 2020

1 commit

  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

02 Dec, 2019

1 commit

  • If a block device supports rw_page operation, it doesn't submit bios so
    the annotation in submit_bio() for refault stall doesn't work. It
    happens with zram in android, especially swap read path which could
    consume CPU cycle for decompress. It is also a problem for zswap which
    uses frontswap.

    Annotate swap_readpage() to account the synchronous IO overhead to
    prevent underreport memory pressure.

    [akpm@linux-foundation.org: add comment, per Johannes]
    Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Nov, 2019

1 commit

  • The following race is observed due to which a processes faulting on a
    swap entry, finds the page neither in swapcache nor swap. This causes
    zram to give a zero filled page that gets mapped to the process,
    resulting in a user space crash later.

    Consider parent and child processes Pa and Pb sharing the same swap slot
    with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
    Virtual address 'VA' of Pa and Pb points to the shared swap entry.

    Pa Pb

    fault on VA fault on VA
    do_swap_page do_swap_page
    lookup_swap_cache fails lookup_swap_cache fails
    Pb scheduled out
    swapin_readahead (deletes zram entry)
    swap_free (makes swap_count 1)
    Pb scheduled in
    swap_readpage (swap_count == 1)
    Takes SWP_SYNCHRONOUS_IO path
    zram enrty absent
    zram gives a zero filled page

    Fix this by making sure that swap slot is freed only when swap count
    drops down to one.

    Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
    Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
    Signed-off-by: Vinayak Menon
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     

13 Jul, 2019

1 commit

  • swap_extent is used to map swap page offset to backing device's block
    offset. For a continuous block range, one swap_extent is used and all
    these swap_extents are managed in a linked list.

    These swap_extents are used by map_swap_entry() during swap's read and
    write path. To find out the backing device's block offset for a page
    offset, the swap_extent list will be traversed linearly, with
    curr_swap_extent being used as a cache to speed up the search.

    This works well as long as swap_extents are not huge or when the number
    of processes that access swap device are few, but when the swap device
    has many extents and there are a number of processes accessing the swap
    device concurrently, it can be a problem. On one of our servers, the
    disk's remaining size is tight:

    $df -h
    Filesystem Size Used Avail Use% Mounted on
    ... ...
    /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4

    When creating a 80G swapfile there, there are as many as 84656 swap
    extents. The end result is, kernel spends abou 30% time in
    map_swap_entry() and swap throughput is only 70MB/s.

    As a comparison, when I used smaller sized swapfile, like 4G whose
    swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
    map_swap_entry() is about 3%.

    One downside of using rbtree for swap_extent is, 'struct rbtree' takes
    24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
    for each swap_extent. For a swapfile that has 80k swap_extents, that
    means 625KiB more memory consumed.

    Test:

    Since it's not possible to reboot that server, I can not test this patch
    diretly there. Instead, I tested it on another server with NVMe disk.

    I created a 20G swapfile on an NVMe backed XFS fs. By default, the
    filesystem is quite clean and the created swapfile has only 2 extents.
    Testing vanilla and this patch shows no obvious performance difference
    when swapfile is not fragmented.

    To see the patch's effects, I used some tweaks to manually fragment the
    swapfile by breaking the extent at 1M boundary. This made the swapfile
    have 20K extents.

    nr_task=4
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 165191 90.77% 171798 90.21%
    patched 858993 +420% 2.16% 715827 +317% 0.77%

    nr_task=8
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 306783 92.19% 318145 87.76%
    patched 954437 +211% 2.35% 1073741 +237% 1.57%

    swapout: the throughput of swap out, in KB/s, higher is better 1st
    map_swap_entry: cpu cycles percent sampled by perf swapin: the
    throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
    cpu cycles percent sampled by perf

    nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
    can be effectively used to cache the correct swap extent for single task
    workload.

    [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
    Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
    Signed-off-by: Aaron Lu
    Cc: Huang Ying
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

05 Jul, 2019

1 commit

  • swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
    this means that the caller can get the spurious wakeup after return.

    This can be fatal if blk_wake_io_task() does
    set_current_state(TASK_RUNNING) after the caller does
    set_special_state(), in the worst case the kernel can crash in
    do_task_dead().

    Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
    Fixes: 0619317ff8baa2d ("block: add polled wakeup task helper")
    Signed-off-by: Oleg Nesterov
    Reported-by: Qian Cai
    Acked-by: Hugh Dickins
    Reviewed-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Jun, 2019

1 commit

  • 0-Day test system reported some OOM regressions for several THP
    (Transparent Huge Page) swap test cases. These regressions are bisected
    to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the
    commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
    bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
    THP. That causes the OOM.

    As in the patch description of 6861428921b5 ("block: always define
    BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
    to swap space. So the issue is fixed via doing that in get_swap_bio().

    BTW: I remember I have checked the THP swap code when 6861428921b5
    ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
    THP swap code needn't to be changed. But apparently, I was wrong. I
    should have done this at that time.

    Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
    Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Ming Lei
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Daniel Jordan
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

05 Jan, 2019

1 commit

  • swap_readpage() wants to do polling to bring in pages if asked to, but
    it doesn't mark the bio as being polled. Additionally, the looping
    around the blk_poll() check isn't correct - if we get a zero return, we
    should call io_schedule(), we can't just assume that the bio has
    completed. The regular bio->bi_private check should be used for that.

    Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Andrew Morton
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

03 Jan, 2019

1 commit

  • This mostly reverts commit 849a370016a5 ("block: avoid ordered task
    state change for polled IO"). It was wrongly claiming that the ordering
    wasn't necessary. The memory barrier _is_ necessary.

    If something is truly polling and not going to sleep, it's the whole
    state setting that is unnecessary, not the memory barrier. Whenever you
    set your state to a sleeping state, you absolutely need the memory
    barrier.

    Note that sometimes the memory barrier can be elsewhere. For example,
    the ordering might be provided by an external lock, or by setting the
    process state to sleeping before adding yourself to the wait queue list
    that is used for waking up (where the wait queue lock itself will
    guarantee that any wakeup will correctly see the sleeping state).

    But none of those cases were true here.

    NOTE! Some of the polling paths may indeed be able to drop the state
    setting entirely, at which point the memory barrier also goes away.

    (Also note that this doesn't revert the TASK_RUNNING cases: there is no
    race between a wakeup and setting the process state to TASK_RUNNING,
    since the end result doesn't depend on ordering).

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Dec, 2018

1 commit

  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

26 Nov, 2018

1 commit

  • blk_poll() has always kept spinning until it found an IO. This is
    fine for SYNC polling, since we need to find one request we have
    pending, but in preparation for ASYNC polling it can be beneficial
    to just check if we have any entries available or not.

    Existing callers are converted to pass in 'spin == true', to retain
    the old behavior.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Nov, 2018

1 commit

  • For the core poll helper, the task state setting don't need to imply any
    atomics, as it's the current task itself that is being modified and
    we're not going to sleep.

    For IRQ driven, the wakeup path have the necessary barriers to not need
    us using the heavy handed version of the task state setting.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2018

1 commit

  • If we're polling for IO on a device that doesn't use interrupts, then
    IO completion loop (and wake of task) is done by submitting task itself.
    If that is the case, then we don't need to enter the wake_up_process()
    function, we can simply mark ourselves as TASK_RUNNING.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

2 commits

  • Pull AFS updates from Al Viro:
    "AFS series, with some iov_iter bits included"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    missing bits of "iov_iter: Separate type from direction and use accessor functions"
    afs: Probe multiple fileservers simultaneously
    afs: Fix callback handling
    afs: Eliminate the address pointer from the address list cursor
    afs: Allow dumping of server cursor on operation failure
    afs: Implement YFS support in the fs client
    afs: Expand data structure fields to support YFS
    afs: Get the target vnode in afs_rmdir() and get a callback on it
    afs: Calc callback expiry in op reply delivery
    afs: Fix FS.FetchStatus delivery from updating wrong vnode
    afs: Implement the YFS cache manager service
    afs: Remove callback details from afs_callback_break struct
    afs: Commit the status on a new file/dir/symlink
    afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
    afs: Don't invoke the server to read data beyond EOF
    afs: Add a couple of tracepoints to log I/O errors
    afs: Handle EIO from delivery function
    afs: Fix TTL on VL server and address lists
    afs: Implement VL server rotation
    afs: Improve FS server rotation error handling
    ...

    Linus Torvalds
     
  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

27 Oct, 2018

1 commit

  • The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
    through the filesystem, and to make swapoff() call ->swap_deactivate().
    For Btrfs, we want the latter but not the former, so split this flag into
    two. This makes us always call ->swap_deactivate() if ->swap_activate()
    succeeded, not just if it didn't add any swap extents itself.

    This also resolves the issue of the very misleading name of SWP_FILE,
    which is only used for swap files over NFS.

    Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: David Sterba
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

24 Oct, 2018

1 commit

  • In the iov_iter struct, separate the iterator type from the iterator
    direction and use accessor functions to access them in most places.

    Convert a bunch of places to use switch-statements to access them rather
    then chains of bitwise-AND statements. This makes it easier to add further
    iterator types. Also, this can be more efficient as to implement a switch
    of small contiguous integers, the compiler can use ~50% fewer compare
    instructions than it has to use bitwise-and instructions.

    Further, cease passing the iterator type into the iterator setup function.
    The iterator function can set that itself. Only the direction is required.

    Signed-off-by: David Howells

    David Howells
     

22 Sep, 2018

1 commit

  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

09 Jul, 2018

2 commits

  • For backcharging we need to know who the page belongs to when swapping
    it out. We don't worry about things that do ->rw_page (zram etc) at the
    moment, we're only worried about pages that actually go to a block
    device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Just like REQ_META, it's important to know the IO coming down is swap
    in order to guard against potential IO priority inversion issues with
    cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
    bio_issue_as_root_blkg helper.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

07 Jan, 2018

1 commit


16 Nov, 2017

1 commit

  • With fast swap storage, the platforms want to use swap more aggressively
    and swap-in is crucial to application latency.

    The rw_page() based synchronous devices like zram, pmem and btt are such
    fast storage. When I profile swapin performance with zram lz4
    decompress test, S/W overhead is more than 70%. Maybe, it would be
    bigger in nvdimm.

    This patch aims to reduce swap-in latency by skipping swapcache if the
    swap device is synchronous device like rw_page based device. It
    enhances 45% my swapin test(5G sequential swapin, no readahead, from
    2.41sec to 1.64sec).

    Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

04 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

1 commit

  • To support delay splitting THP (Transparent Huge Page) after swapped
    out, we need to enhance swap writing code to support to write a THP as a
    whole. This will improve swap write IO performance.

    As Ming Lei pointed out, this should be based on
    multipage bvec support, which hasn't been merged yet. So this patch is
    only for testing the functionality of the other patches in the series.
    And will be reimplemented after multipage bvec support is merged.

    Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Aug, 2017

1 commit

  • When a thread is OOM-killed during swap_readpage() operation, an oops
    occurs because end_swap_bio_read() is calling wake_up_process() based on
    an assumption that the thread which called swap_readpage() is still
    alive.

    Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child
    Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
    oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    task: ffffffffb7c16500 task.stack: ffffffffb7c00000
    RIP: 0010:__lock_acquire+0x151/0x12f0
    Call Trace:

    lock_acquire+0x59/0x80
    _raw_spin_lock_irqsave+0x3b/0x4f
    try_to_wake_up+0x3b/0x410
    wake_up_process+0x10/0x20
    end_swap_bio_read+0x6f/0xf0
    bio_endio+0x92/0xb0
    blk_update_request+0x88/0x270
    scsi_end_request+0x32/0x1c0
    scsi_io_completion+0x209/0x680
    scsi_finish_command+0xd4/0x120
    scsi_softirq_done+0x120/0x140
    __blk_mq_complete_request_remote+0xe/0x10
    flush_smp_call_function_queue+0x51/0x120
    generic_smp_call_function_single_interrupt+0xe/0x20
    smp_trace_call_function_single_interrupt+0x22/0x30
    smp_call_function_single_interrupt+0x9/0x10
    call_function_single_interrupt+0xa7/0xb0

    RIP: 0010:native_safe_halt+0x6/0x10
    default_idle+0xe/0x20
    arch_cpu_idle+0xa/0x10
    default_idle_call+0x1e/0x30
    do_idle+0x187/0x200
    cpu_startup_entry+0x6e/0x70
    rest_init+0xd0/0xe0
    start_kernel+0x456/0x477
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0xf7/0x11a
    secondary_startup_64+0xa5/0xa5
    Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85
    RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50
    ---[ end trace 6c441db499169b1e ]---
    Kernel panic - not syncing: Fatal exception in interrupt
    Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    ---[ end Kernel panic - not syncing: Fatal exception in interrupt

    Fix it by holding a reference to the thread.

    [akpm@linux-foundation.org: add comment]
    Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path")
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa