24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

30 May, 2020

2 commits


17 Apr, 2020

1 commit


06 Oct, 2019

1 commit

  • scale_up wakes up waiters after scaling up. But after scaling max, it
    should not wake up more waiters as waiters will not have anything to
    do. This patch fixes this by making scale_up (and also scale_down)
    return when threshold is reached.

    This bug causes increased fdatasync latency when fdatasync and dd
    conv=sync are performed in parallel on 4.19 compared to 4.14. This
    bug was introduced during refactoring of blk-wbt code.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Cc: stable@vger.kernel.org
    Cc: Josef Bacik
    Signed-off-by: Harshad Shirwadkar
    Signed-off-by: Jens Axboe

    Harshad Shirwadkar
     

29 Aug, 2019

1 commit


28 Aug, 2019

1 commit


01 May, 2019

1 commit


25 Jan, 2019

1 commit

  • This patch avoids that sparse reports the following warnings:

    CHECK block/blk-wbt.c
    block/blk-wbt.c:600:6: warning: symbol 'wbt_issue' was not declared. Should it be static?
    block/blk-wbt.c:620:6: warning: symbol 'wbt_requeue' was not declared. Should it be static?
    CC block/blk-wbt.o
    block/blk-wbt.c:600:6: warning: no previous prototype for wbt_issue [-Wmissing-prototypes]
    void wbt_issue(struct rq_qos *rqos, struct request *rq)
    ^~~~~~~~~
    block/blk-wbt.c:620:6: warning: no previous prototype for wbt_requeue [-Wmissing-prototypes]
    void wbt_requeue(struct rq_qos *rqos, struct request *rq)
    ^~~~~~~~~~~

    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

17 Dec, 2018

1 commit


12 Dec, 2018

1 commit

  • rwb_enabled() can't be changed when there is any inflight IO.

    wbt_disable_default() may set rwb->wb_normal as zero, however the
    blk_stat timer may still be pending, and the timer function will update
    wrb->wb_normal again.

    This patch introduces blk_stat_deactivate() and applies it in
    wbt_disable_default(), then the following IO hang triggered when running
    parted & switching io scheduler can be fixed:

    [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
    [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
    [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 369.940768] parted D 0 3645 3239 0x00000000
    [ 369.941500] Call Trace:
    [ 369.941874] ? __schedule+0x6d9/0x74c
    [ 369.942392] ? wbt_done+0x5e/0x5e
    [ 369.942864] ? wbt_cleanup_cb+0x16/0x16
    [ 369.943404] ? wbt_done+0x5e/0x5e
    [ 369.943874] schedule+0x67/0x78
    [ 369.944298] io_schedule+0x12/0x33
    [ 369.944771] rq_qos_wait+0xb5/0x119
    [ 369.945193] ? karma_partition+0x1c2/0x1c2
    [ 369.945691] ? wbt_cleanup_cb+0x16/0x16
    [ 369.946151] wbt_wait+0x85/0xb6
    [ 369.946540] __rq_qos_throttle+0x23/0x2f
    [ 369.947014] blk_mq_make_request+0xe6/0x40a
    [ 369.947518] generic_make_request+0x192/0x2fe
    [ 369.948042] ? submit_bio+0x103/0x11f
    [ 369.948486] ? __radix_tree_lookup+0x35/0xb5
    [ 369.949011] submit_bio+0x103/0x11f
    [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
    [ 369.949962] submit_bio_wait+0x53/0x7f
    [ 369.950469] blkdev_issue_flush+0x8a/0xae
    [ 369.951032] blkdev_fsync+0x2f/0x3a
    [ 369.951502] do_fsync+0x2e/0x47
    [ 369.951887] __x64_sys_fsync+0x10/0x13
    [ 369.952374] do_syscall_64+0x89/0x149
    [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 369.953492] RIP: 0033:0x7f95a1e729d4
    [ 369.953996] Code: Bad RIP value.
    [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
    [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
    [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
    [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
    [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

    Cc: stable@vger.kernel.org
    Cc: Paolo Valente
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Dec, 2018

1 commit


16 Nov, 2018

4 commits


08 Nov, 2018

1 commit


12 Oct, 2018

1 commit

  • Tetsuo brought to my attention that I screwed up the scale_up/scale_down
    helpers when I factored out the rq-qos code. We need to wake up all the
    waiters when we add slots for requests to make, not when we shrink the
    slots. Otherwise we'll end up things waiting forever. This was a
    mistake and simply puts everything back the way it was.

    cc: stable@vger.kernel.org
    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    eported-by: Tetsuo Handa
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

28 Aug, 2018

3 commits

  • We already note and mark discard and swap IO from bio_to_wbt_flags().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We have two potential issues:

    1) After commit 2887e41b910b, we only wake one process at the time when
    we finish an IO. We really want to wake up as many tasks as can
    queue IO. Before this commit, we woke up everyone, which could cause
    a thundering herd issue.

    2) A task can potentially consume two wakeups, causing us to (in
    practice) miss a wakeup.

    Fix both by providing our own wakeup function, which stops
    __wake_up_common() from waking up more tasks if we fail to get a
    queueing token. With the strict ordering we have on the wait list, this
    wakes the right tasks and the right amount of tasks.

    Based on a patch from Jianchao Wang .

    Tested-by: Agarwal, Anchal
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Prep patch for calling the handler from a different context,
    no functional changes in this patch.

    Tested-by: Agarwal, Anchal
    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Aug, 2018

4 commits

  • A previous commit removed the ability to have per-rq flags. We used
    those flags to maintain inflight counts. Since we don't have those
    anymore, we have to always maintain inflight counts, even if wbt is
    disabled. This is clearly suboptimal.

    Add a queue quiesce around changing the wbt latency settings from sysfs
    to work around this. With that, we can reliably put the enabled check in
    our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
    consistent for the lifetime of the request.

    Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We need to do this inside the loop as well, or we can allow new
    IO to supersede previous IO.

    Tested-by: Anchal Agarwal
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We need the memory barrier before checking the list head,
    use the appropriate helper for this. The matching queue
    side memory barrier is provided by set_current_state().

    Tested-by: Anchal Agarwal
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Check it in one place, instead of in multiple places.

    Tested-by: Anchal Agarwal
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Aug, 2018

1 commit

  • On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
    should be updated for tracking this IO.

    But commit c1c80384c8f ("block: remove external dependency on wbt_flags")
    forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
    then the inflight counter may not be increased in wbt_wait(), but decreased
    in wbt_done() for this kind of IO, so this counter may become negative, then
    wbt_wait() may wait forever.

    This patch fixes the report in the following link:

    https://marc.info/?l=linux-block&m=153221542021033&w=2

    Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
    Cc: Josef Bacik
    Reported-by: Ming Lei
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Aug, 2018

1 commit

  • I am currently running a large bare metal instance (i3.metal)
    on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
    4.18 kernel. I have a workload that simulates a database
    workload and I am running into lockup issues when writeback
    throttling is enabled,with the hung task detector also
    kicking in.

    Crash dumps show that most CPUs (up to 50 of them) are
    all trying to get the wbt wait queue lock while trying to add
    themselves to it in __wbt_wait (see stack traces below).

    [ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
    [ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
    [ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
    [ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
    [ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
    [ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
    [ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
    [ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
    [ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
    [ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
    [ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
    [ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
    [ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 0.948138] Call Trace:
    [ 0.948139]
    [ 0.948142] do_raw_spin_lock+0xad/0xc0
    [ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b
    [ 0.948149] ? __wake_up_common_lock+0x53/0x90
    [ 0.948150] __wake_up_common_lock+0x53/0x90
    [ 0.948155] wbt_done+0x7b/0xa0
    [ 0.948158] blk_mq_free_request+0xb7/0x110
    [ 0.948161] __blk_mq_complete_request+0xcb/0x140
    [ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme]
    [ 0.948169] nvme_irq+0x23/0x50 [nvme]
    [ 0.948173] __handle_irq_event_percpu+0x46/0x300
    [ 0.948176] handle_irq_event_percpu+0x20/0x50
    [ 0.948179] handle_irq_event+0x34/0x60
    [ 0.948181] handle_edge_irq+0x77/0x190
    [ 0.948185] handle_irq+0xaf/0x120
    [ 0.948188] do_IRQ+0x53/0x110
    [ 0.948191] common_interrupt+0x87/0x87
    [ 0.948192]
    ....
    [ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
    [ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
    [ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
    [ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
    [ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
    [ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
    [ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
    [ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
    [ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
    [ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
    [ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
    [ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
    [ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 0.311154] Call Trace:
    [ 0.311157] do_raw_spin_lock+0xad/0xc0
    [ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b
    [ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0
    [ 0.311164] prepare_to_wait_exclusive+0x28/0xb0
    [ 0.311167] wbt_wait+0x127/0x330
    [ 0.311169] ? finish_wait+0x80/0x80
    [ 0.311172] ? generic_make_request+0xda/0x3b0
    [ 0.311174] blk_mq_make_request+0xd6/0x7b0
    [ 0.311176] ? blk_queue_enter+0x24/0x260
    [ 0.311178] ? generic_make_request+0xda/0x3b0
    [ 0.311181] generic_make_request+0x10c/0x3b0
    [ 0.311183] ? submit_bio+0x5c/0x110
    [ 0.311185] submit_bio+0x5c/0x110
    [ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4]
    [ 0.311210] ext4_io_submit+0x48/0x60 [ext4]
    [ 0.311222] ext4_writepages+0x810/0x11f0 [ext4]
    [ 0.311229] ? do_writepages+0x3c/0xd0
    [ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
    [ 0.311240] do_writepages+0x3c/0xd0
    [ 0.311243] ? _raw_spin_unlock+0x24/0x30
    [ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280
    [ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0
    [ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0
    [ 0.311253] file_write_and_wait_range+0x34/0x90
    [ 0.311264] ext4_sync_file+0x151/0x500 [ext4]
    [ 0.311267] do_fsync+0x38/0x60
    [ 0.311270] SyS_fsync+0xc/0x10
    [ 0.311272] do_syscall_64+0x6f/0x170
    [ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7

    In the original patch, wbt_done is waking up all the exclusive
    processes in the wait queue, which can cause a thundering herd
    if there is a large number of writer threads in the queue. The
    original intention of the code seems to be to wake up one thread
    only however, it uses wake_up_all() in __wbt_done(), and then
    uses the following check in __wbt_wait to have only one thread
    actually get out of the wait loop:

    if (waitqueue_active(&rqw->wait) &&
    rqw->wait.head.next != &wait->entry)
    return false;

    The problem with this is that the wait entry in wbt_wait is
    define with DEFINE_WAIT, which uses the autoremove wakeup function.
    That means that the above check is invalid - the wait entry will
    have been removed from the queue already by the time we hit the
    check in the loop.

    Secondly, auto-removing the wait entries also means that the wait
    queue essentially gets reordered "randomly" (e.g. threads re-add
    themselves in the order they got to run after being woken up).
    Additionally, new requests entering wbt_wait might overtake requests
    that were queued earlier, because the wait queue will be
    (temporarily) empty after the wake_up_all, so the waitqueue_active
    check will not stop them. This can cause certain threads to starve
    under high load.

    The fix is to leave the woken up requests in the queue and remove
    them in finish_wait() once the current thread breaks out of the
    wait loop in __wbt_wait. This will ensure new requests always
    end up at the back of the queue, and they won't overtake requests
    that are already in the wait queue. With that change, the loop
    in wbt_wait is also in line with many other wait loops in the kernel.
    Waking up just one thread drastically reduces lock contention, as
    does moving the wait queue add/remove out of the loop.

    A significant drop in lockdep's lock contention numbers is seen when
    running the test application on the patched kernel.

    Signed-off-by: Anchal Agarwal
    Signed-off-by: Frank van der Linden
    Signed-off-by: Jens Axboe

    Anchal Agarwal
     

09 Jul, 2018

2 commits


09 May, 2018

6 commits

  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • issue_stat is going to go away, so first make writeback throttling take
    the containing request, update the internal wbt helpers accordingly, and
    change rwb->sync_cookie to be the request pointer instead of the
    issue_stat pointer. No functional change.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • A few helpers are only used from blk-wbt.c, so move them there, and put
    wbt_track() behind the CONFIG_BLK_WBT typedef. This is in preparation
    for changing how the wbt flags are tracked.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Throttle discards like we would any background write. Discards should
    be background activity, so if they are impacting foreground IO, then
    we will throttle them down.

    Reviewed-by: Darrick J. Wong
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is in preparation for having more write queues, in which
    case we would have needed to pass in more information than just
    a simple 'is_kswapd' boolean.

    Reviewed-by: Darrick J. Wong
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently special case WRITE and FLUSH, but we should really
    just include any command with the write bit set. This ensures
    that we account DISCARD.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Feb, 2018

1 commit

  • Mikulas reported a workload that saw bad performance, and figured
    out what it was due to various other types of requests being
    accounted as reads. Flush requests, for instance. Due to the
    high latency of those, we heavily throttle the writes to keep
    the latencies in balance. But they really should be accounted
    as writes.

    Fix this by checking the exact type of the request. If it's a
    read, account as a read, if it's a write or a flush, account
    as a write. Any other request we disregard. Previously everything
    would have been mistakenly accounted as reads.

    Reported-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Nov, 2017

3 commits


15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds