02 Dec, 2019

1 commit

  • If a block device supports rw_page operation, it doesn't submit bios so
    the annotation in submit_bio() for refault stall doesn't work. It
    happens with zram in android, especially swap read path which could
    consume CPU cycle for decompress. It is also a problem for zswap which
    uses frontswap.

    Annotate swap_readpage() to account the synchronous IO overhead to
    prevent underreport memory pressure.

    [akpm@linux-foundation.org: add comment, per Johannes]
    Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Nov, 2019

1 commit

  • The following race is observed due to which a processes faulting on a
    swap entry, finds the page neither in swapcache nor swap. This causes
    zram to give a zero filled page that gets mapped to the process,
    resulting in a user space crash later.

    Consider parent and child processes Pa and Pb sharing the same swap slot
    with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
    Virtual address 'VA' of Pa and Pb points to the shared swap entry.

    Pa Pb

    fault on VA fault on VA
    do_swap_page do_swap_page
    lookup_swap_cache fails lookup_swap_cache fails
    Pb scheduled out
    swapin_readahead (deletes zram entry)
    swap_free (makes swap_count 1)
    Pb scheduled in
    swap_readpage (swap_count == 1)
    Takes SWP_SYNCHRONOUS_IO path
    zram enrty absent
    zram gives a zero filled page

    Fix this by making sure that swap slot is freed only when swap count
    drops down to one.

    Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
    Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
    Signed-off-by: Vinayak Menon
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     

13 Jul, 2019

1 commit

  • swap_extent is used to map swap page offset to backing device's block
    offset. For a continuous block range, one swap_extent is used and all
    these swap_extents are managed in a linked list.

    These swap_extents are used by map_swap_entry() during swap's read and
    write path. To find out the backing device's block offset for a page
    offset, the swap_extent list will be traversed linearly, with
    curr_swap_extent being used as a cache to speed up the search.

    This works well as long as swap_extents are not huge or when the number
    of processes that access swap device are few, but when the swap device
    has many extents and there are a number of processes accessing the swap
    device concurrently, it can be a problem. On one of our servers, the
    disk's remaining size is tight:

    $df -h
    Filesystem Size Used Avail Use% Mounted on
    ... ...
    /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4

    When creating a 80G swapfile there, there are as many as 84656 swap
    extents. The end result is, kernel spends abou 30% time in
    map_swap_entry() and swap throughput is only 70MB/s.

    As a comparison, when I used smaller sized swapfile, like 4G whose
    swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
    map_swap_entry() is about 3%.

    One downside of using rbtree for swap_extent is, 'struct rbtree' takes
    24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
    for each swap_extent. For a swapfile that has 80k swap_extents, that
    means 625KiB more memory consumed.

    Test:

    Since it's not possible to reboot that server, I can not test this patch
    diretly there. Instead, I tested it on another server with NVMe disk.

    I created a 20G swapfile on an NVMe backed XFS fs. By default, the
    filesystem is quite clean and the created swapfile has only 2 extents.
    Testing vanilla and this patch shows no obvious performance difference
    when swapfile is not fragmented.

    To see the patch's effects, I used some tweaks to manually fragment the
    swapfile by breaking the extent at 1M boundary. This made the swapfile
    have 20K extents.

    nr_task=4
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 165191 90.77% 171798 90.21%
    patched 858993 +420% 2.16% 715827 +317% 0.77%

    nr_task=8
    kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
    vanilla 306783 92.19% 318145 87.76%
    patched 954437 +211% 2.35% 1073741 +237% 1.57%

    swapout: the throughput of swap out, in KB/s, higher is better 1st
    map_swap_entry: cpu cycles percent sampled by perf swapin: the
    throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
    cpu cycles percent sampled by perf

    nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
    can be effectively used to cache the correct swap extent for single task
    workload.

    [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
    Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
    Signed-off-by: Aaron Lu
    Cc: Huang Ying
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

05 Jul, 2019

1 commit

  • swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
    this means that the caller can get the spurious wakeup after return.

    This can be fatal if blk_wake_io_task() does
    set_current_state(TASK_RUNNING) after the caller does
    set_special_state(), in the worst case the kernel can crash in
    do_task_dead().

    Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
    Fixes: 0619317ff8baa2d ("block: add polled wakeup task helper")
    Signed-off-by: Oleg Nesterov
    Reported-by: Qian Cai
    Acked-by: Hugh Dickins
    Reviewed-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Jun, 2019

1 commit

  • 0-Day test system reported some OOM regressions for several THP
    (Transparent Huge Page) swap test cases. These regressions are bisected
    to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the
    commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
    bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
    THP. That causes the OOM.

    As in the patch description of 6861428921b5 ("block: always define
    BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
    to swap space. So the issue is fixed via doing that in get_swap_bio().

    BTW: I remember I have checked the THP swap code when 6861428921b5
    ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
    THP swap code needn't to be changed. But apparently, I was wrong. I
    should have done this at that time.

    Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
    Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Ming Lei
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Daniel Jordan
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

05 Jan, 2019

1 commit

  • swap_readpage() wants to do polling to bring in pages if asked to, but
    it doesn't mark the bio as being polled. Additionally, the looping
    around the blk_poll() check isn't correct - if we get a zero return, we
    should call io_schedule(), we can't just assume that the bio has
    completed. The regular bio->bi_private check should be used for that.

    Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Andrew Morton
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

03 Jan, 2019

1 commit

  • This mostly reverts commit 849a370016a5 ("block: avoid ordered task
    state change for polled IO"). It was wrongly claiming that the ordering
    wasn't necessary. The memory barrier _is_ necessary.

    If something is truly polling and not going to sleep, it's the whole
    state setting that is unnecessary, not the memory barrier. Whenever you
    set your state to a sleeping state, you absolutely need the memory
    barrier.

    Note that sometimes the memory barrier can be elsewhere. For example,
    the ordering might be provided by an external lock, or by setting the
    process state to sleeping before adding yourself to the wait queue list
    that is used for waking up (where the wait queue lock itself will
    guarantee that any wakeup will correctly see the sleeping state).

    But none of those cases were true here.

    NOTE! Some of the polling paths may indeed be able to drop the state
    setting entirely, at which point the memory barrier also goes away.

    (Also note that this doesn't revert the TASK_RUNNING cases: there is no
    race between a wakeup and setting the process state to TASK_RUNNING,
    since the end result doesn't depend on ordering).

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Dec, 2018

1 commit

  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

26 Nov, 2018

1 commit

  • blk_poll() has always kept spinning until it found an IO. This is
    fine for SYNC polling, since we need to find one request we have
    pending, but in preparation for ASYNC polling it can be beneficial
    to just check if we have any entries available or not.

    Existing callers are converted to pass in 'spin == true', to retain
    the old behavior.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Nov, 2018

1 commit

  • For the core poll helper, the task state setting don't need to imply any
    atomics, as it's the current task itself that is being modified and
    we're not going to sleep.

    For IRQ driven, the wakeup path have the necessary barriers to not need
    us using the heavy handed version of the task state setting.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2018

1 commit

  • If we're polling for IO on a device that doesn't use interrupts, then
    IO completion loop (and wake of task) is done by submitting task itself.
    If that is the case, then we don't need to enter the wake_up_process()
    function, we can simply mark ourselves as TASK_RUNNING.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

2 commits

  • Pull AFS updates from Al Viro:
    "AFS series, with some iov_iter bits included"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    missing bits of "iov_iter: Separate type from direction and use accessor functions"
    afs: Probe multiple fileservers simultaneously
    afs: Fix callback handling
    afs: Eliminate the address pointer from the address list cursor
    afs: Allow dumping of server cursor on operation failure
    afs: Implement YFS support in the fs client
    afs: Expand data structure fields to support YFS
    afs: Get the target vnode in afs_rmdir() and get a callback on it
    afs: Calc callback expiry in op reply delivery
    afs: Fix FS.FetchStatus delivery from updating wrong vnode
    afs: Implement the YFS cache manager service
    afs: Remove callback details from afs_callback_break struct
    afs: Commit the status on a new file/dir/symlink
    afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
    afs: Don't invoke the server to read data beyond EOF
    afs: Add a couple of tracepoints to log I/O errors
    afs: Handle EIO from delivery function
    afs: Fix TTL on VL server and address lists
    afs: Implement VL server rotation
    afs: Improve FS server rotation error handling
    ...

    Linus Torvalds
     
  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

27 Oct, 2018

1 commit

  • The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
    through the filesystem, and to make swapoff() call ->swap_deactivate().
    For Btrfs, we want the latter but not the former, so split this flag into
    two. This makes us always call ->swap_deactivate() if ->swap_activate()
    succeeded, not just if it didn't add any swap extents itself.

    This also resolves the issue of the very misleading name of SWP_FILE,
    which is only used for swap files over NFS.

    Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: David Sterba
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

24 Oct, 2018

1 commit

  • In the iov_iter struct, separate the iterator type from the iterator
    direction and use accessor functions to access them in most places.

    Convert a bunch of places to use switch-statements to access them rather
    then chains of bitwise-AND statements. This makes it easier to add further
    iterator types. Also, this can be more efficient as to implement a switch
    of small contiguous integers, the compiler can use ~50% fewer compare
    instructions than it has to use bitwise-and instructions.

    Further, cease passing the iterator type into the iterator setup function.
    The iterator function can set that itself. Only the direction is required.

    Signed-off-by: David Howells

    David Howells
     

22 Sep, 2018

1 commit

  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

09 Jul, 2018

2 commits

  • For backcharging we need to know who the page belongs to when swapping
    it out. We don't worry about things that do ->rw_page (zram etc) at the
    moment, we're only worried about pages that actually go to a block
    device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Just like REQ_META, it's important to know the IO coming down is swap
    in order to guard against potential IO priority inversion issues with
    cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
    bio_issue_as_root_blkg helper.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

07 Jan, 2018

1 commit


16 Nov, 2017

1 commit

  • With fast swap storage, the platforms want to use swap more aggressively
    and swap-in is crucial to application latency.

    The rw_page() based synchronous devices like zram, pmem and btt are such
    fast storage. When I profile swapin performance with zram lz4
    decompress test, S/W overhead is more than 70%. Maybe, it would be
    bigger in nvdimm.

    This patch aims to reduce swap-in latency by skipping swapcache if the
    swap device is synchronous device like rw_page based device. It
    enhances 45% my swapin test(5G sequential swapin, no readahead, from
    2.41sec to 1.64sec).

    Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Sergey Senozhatsky
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

04 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

1 commit

  • To support delay splitting THP (Transparent Huge Page) after swapped
    out, we need to enhance swap writing code to support to write a THP as a
    whole. This will improve swap write IO performance.

    As Ming Lei pointed out, this should be based on
    multipage bvec support, which hasn't been merged yet. So this patch is
    only for testing the functionality of the other patches in the series.
    And will be reimplemented after multipage bvec support is merged.

    Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Aug, 2017

1 commit

  • When a thread is OOM-killed during swap_readpage() operation, an oops
    occurs because end_swap_bio_read() is calling wake_up_process() based on
    an assumption that the thread which called swap_readpage() is still
    alive.

    Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child
    Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
    oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    task: ffffffffb7c16500 task.stack: ffffffffb7c00000
    RIP: 0010:__lock_acquire+0x151/0x12f0
    Call Trace:

    lock_acquire+0x59/0x80
    _raw_spin_lock_irqsave+0x3b/0x4f
    try_to_wake_up+0x3b/0x410
    wake_up_process+0x10/0x20
    end_swap_bio_read+0x6f/0xf0
    bio_endio+0x92/0xb0
    blk_update_request+0x88/0x270
    scsi_end_request+0x32/0x1c0
    scsi_io_completion+0x209/0x680
    scsi_finish_command+0xd4/0x120
    scsi_softirq_done+0x120/0x140
    __blk_mq_complete_request_remote+0xe/0x10
    flush_smp_call_function_queue+0x51/0x120
    generic_smp_call_function_single_interrupt+0xe/0x20
    smp_trace_call_function_single_interrupt+0x22/0x30
    smp_call_function_single_interrupt+0x9/0x10
    call_function_single_interrupt+0xa7/0xb0

    RIP: 0010:native_safe_halt+0x6/0x10
    default_idle+0xe/0x20
    arch_cpu_idle+0xa/0x10
    default_idle_call+0x1e/0x30
    do_idle+0x187/0x200
    cpu_startup_entry+0x6e/0x70
    rest_init+0xd0/0xe0
    start_kernel+0x456/0x477
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0xf7/0x11a
    secondary_startup_64+0xa5/0xa5
    Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85
    RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50
    ---[ end trace 6c441db499169b1e ]---
    Kernel panic - not syncing: Fatal exception in interrupt
    Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    ---[ end Kernel panic - not syncing: Fatal exception in interrupt

    Fix it by holding a reference to the thread.

    [akpm@linux-foundation.org: add comment]
    Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path")
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

11 Jul, 2017

1 commit

  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Nov, 2016

1 commit

  • Add wbc_to_write_flags(), which returns the write modifier flags to use,
    based on a struct writeback_control. No functional changes in this
    patch, but it prepares us for factoring other wbc fields for write type.

    Signed-off-by: Jens Axboe
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

08 Oct, 2016

1 commit


20 Sep, 2016

1 commit

  • Commit 62c230bc1790 ("mm: add support for a filesystem to activate
    swap files and use direct_IO for writing swap pages") replaced the
    swap_aops dirty hook from __set_page_dirty_no_writeback() with
    swap_set_page_dirty().

    For normal cases without these special SWP flags code path falls back to
    __set_page_dirty_no_writeback() so the behaviour is expected to be the
    same as before.

    But swap_set_page_dirty() makes use of the page_swap_info() helper to
    get the swap_info_struct to check for the flags like SWP_FILE,
    SWP_BLKDEV etc as desired for those features. This helper has
    BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
    set_page_dirty_lock() path.

    For the set_page_dirty() path which is often needed for cases to be
    called from irq context, kswapd() can toggle the flag behind the back
    while the call is getting executed when system is low on memory and
    heavy swapping is ongoing.

    This ends up with undesired kernel panic.

    This patch just moves the check outside the helper to its users
    appropriately to fix kernel panic for the described path. Couple of
    users of helpers already take care of SwapCache condition so I skipped
    them.

    Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
    Signed-off-by: Santosh Shilimkar
    Cc: Mel Gorman
    Cc: Joe Perches
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: David S. Miller
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Al Viro
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     

08 Aug, 2016

1 commit


29 Jul, 2016

1 commit

  • generic_swapfile_activate() can take quite long time, it iterates over
    all blocks of a file, so add cond_resched to it. I observed about 1
    second stalls when activating a swapfile that was almost unfragmented -
    this patch fixes it.

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Alexander Viro
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

08 Jun, 2016

2 commits

  • This patch converts the simple bi_rw use cases in the block,
    drivers, mm and fs code to set/get the bio operation using
    bio_set_op_attrs/bio_op

    These should be simple one or two liner cases, so I just did them
    in one patch. The next patches handle the more complicated
    cases in a module per patch.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
    instead of passing it in. This makes that use the same as
    generic_make_request and how we set the other bio fields.

    Signed-off-by: Mike Christie

    Fixed up fs/ext4/crypto.c

    Signed-off-by: Jens Axboe

    Mike Christie
     

18 May, 2016

1 commit

  • Pull vfs cleanups from Al Viro:
    "More cleanups from Christoph"

    * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nfsd: use RWF_SYNC
    fs: add RWF_DSYNC aand RWF_SYNC
    ceph: use generic_write_sync
    fs: simplify the generic_write_sync prototype
    fs: add IOCB_SYNC and IOCB_DSYNC
    direct-io: remove the offset argument to dio_complete
    direct-io: eliminate the offset argument to ->direct_IO
    xfs: eliminate the pos variable in xfs_file_dio_aio_write
    filemap: remove the pos argument to generic_file_direct_write
    filemap: remove pos variables in generic_file_read_iter

    Linus Torvalds
     

02 May, 2016

1 commit


29 Apr, 2016

1 commit

  • Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in
    page_swap_info. The reason is that page_endio in rw_page unlocks the
    page if read I/O is completed so we need to hold a PG_lock again to
    check PageSwapCache. Otherwise, the page can be removed from swapcache.

    Kernel BUG at c00f9040 [verbose debug info unavailable]
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
    Modules linked in:
    CPU: 4 PID: 13446 Comm: RenderThread Tainted: G W 3.10.84-g9f14aec-dirty #73
    task: c3b73200 ti: dd192000 task.ti: dd192000
    PC is at page_swap_info+0x10/0x2c
    LR is at swap_slot_free_notify+0x18/0x6c
    pc : [] lr : [] psr: 400f0113
    sp : dd193d78 ip : c2deb1e4 fp : da015180
    r10: 00000000 r9 : 000200da r8 : c120fe08
    r7 : 00000000 r6 : 00000000 r5 : c249a6c0 r4 : = c249a6c0
    r3 : 00000000 r2 : 40080009 r1 : 200f0113 r0 : = c249a6c0
    .. ..
    Call Trace:
    page_swap_info+0x10/0x2c
    swap_slot_free_notify+0x18/0x6c
    swap_readpage+0x90/0x11c
    read_swap_cache_async+0x134/0x1ac
    swapin_readahead+0x70/0xb0
    handle_pte_fault+0x320/0x6fc
    handle_mm_fault+0xc0/0xf0
    do_page_fault+0x11c/0x36c
    do_DataAbort+0x34/0x118

    Fixes: 3f2b1a04f44933f2 ("zram: revive swap_slot_free_notify")
    Signed-off-by: Minchan Kim
    Tested-by: Kyeongdon Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim