17 Aug, 2022

1 commit

  • [ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]

    In our test of iocost, we encountered some list add/del corruptions of
    inner_walk list in ioc_timer_fn.

    The reason can be described as follows:

    cpu 0 cpu 1
    ioc_qos_write ioc_qos_write

    ioc = q_to_ioc(queue);
    if (!ioc) {
    ioc = kzalloc();
    ioc = q_to_ioc(queue);
    if (!ioc) {
    ioc = kzalloc();
    ...
    rq_qos_add(q, rqos);
    }
    ...
    rq_qos_add(q, rqos);
    ...
    }

    When the io.cost.qos file is written by two cpus concurrently, rq_qos may
    be added to one disk twice. In that case, there will be two iocs enabled
    and running on one disk. They own different iocgs on their active list. In
    the ioc_timer_fn function, because of the iocgs from two iocs have the
    same root iocg, the root iocg's walk_list may be overwritten by each other
    and this leads to list add/del corruptions in building or destroying the
    inner_walk list.

    And so far, the blk-rq-qos framework works in case that one instance for
    one type rq_qos per queue by default. This patch make this explicit and
    also fix the crash above.

    Signed-off-by: Jinke Han
    Reviewed-by: Muchun Song
    Acked-by: Tejun Heo
    Cc:
    Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jinke Han
     

19 Nov, 2021

1 commit

  • [ Upstream commit 480d42dc001bbfe953825a92073012fcd5a99161 ]

    The timer callback used to evaluate if the latency is exceeded can be
    executed after the corresponding disk has been released, causing the
    following NULL pointer dereference:

    [ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
    [ 119.987617] #PF: supervisor read access in kernel mode
    [ 119.987971] #PF: error_code(0x0000) - not-present page
    [ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
    [ 119.988697] Oops: 0000 [#1] SMP NOPTI
    [ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
    [ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
    [ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
    [ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
    [ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
    [ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
    [ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
    [ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
    [ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
    [ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
    [ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
    [ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
    [ 119.995906] Call Trace:
    [ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
    [ 119.996505] blk_stat_timer_fn+0x138/0x140
    [ 119.996830] call_timer_fn+0x2b/0x100
    [ 119.997136] __run_timers.part.0+0x1d1/0x240
    [ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
    [ 119.997826] ? ktime_get+0x3e/0xa0
    [ 119.998110] ? native_apic_msr_write+0x2c/0x30
    [ 119.998456] ? lapic_next_event+0x20/0x30
    [ 119.998779] ? clockevents_program_event+0x94/0xf0
    [ 119.999150] run_timer_softirq+0x2a/0x50
    [ 119.999465] __do_softirq+0xcb/0x26f
    [ 119.999764] irq_exit_rcu+0x8c/0xb0
    [ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
    [ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
    [ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20

    In this case simply return from the timer callback (no action
    required) to prevent the NULL pointer dereference.

    BugLink: https://bugs.launchpad.net/bugs/1947557
    Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
    Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
    Signed-off-by: Andrea Righi
    Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Andrea Righi
     

24 Aug, 2021

1 commit

  • Replace the magic lookup through the kobject tree with an explicit
    backpointer, given that the device model links are set up and torn
    down at times when I/O is still possible, leading to potential
    NULL or invalid pointer dereferences.

    Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
    Reported-by: syzbot
    Signed-off-by: Christoph Hellwig
    Tested-by: Sven Schnelle
    Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Aug, 2021

1 commit


22 Jun, 2021

2 commits

  • After commit a79050434b45 ("blk-rq-qos: refactor out common elements of
    blk-wbt"), if throttle was disabled by wbt_disable_default(), we could
    not enable again, fix this by set enable_state back to
    WBT_STATE_ON_DEFAULT.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Signed-off-by: Zhang Yi
    Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com
    Signed-off-by: Jens Axboe

    Zhang Yi
     
  • Now that we disable wbt by simply zero out rwb->wb_normal in
    wbt_disable_default() when switch elevator to bfq, but it's not safe
    because it will become false positive if we change queue depth. If it
    become false positive between wbt_wait() and wbt_track() when submit
    write request, it will lead to drop rqw->inflight to -1 in wbt_done(),
    which will end up trigger IO hung. Fix this issue by introduce a new
    state which mean the wbt was disabled.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Signed-off-by: Zhang Yi
    Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com
    Signed-off-by: Jens Axboe

    Zhang Yi
     

19 Jun, 2021

1 commit

  • Now wbt_wait() returns void, so remove now outdated comment.

    Signed-off-by: lijiazi
    Link: https://lore.kernel.org/r/1623986240-13878-1-git-send-email-lijiazi@xiaomi.com
    Signed-off-by: Jens Axboe

    lijiazi
     

27 Jan, 2021

1 commit


01 Dec, 2020

1 commit


24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

30 May, 2020

2 commits


17 Apr, 2020

1 commit


06 Oct, 2019

1 commit

  • scale_up wakes up waiters after scaling up. But after scaling max, it
    should not wake up more waiters as waiters will not have anything to
    do. This patch fixes this by making scale_up (and also scale_down)
    return when threshold is reached.

    This bug causes increased fdatasync latency when fdatasync and dd
    conv=sync are performed in parallel on 4.19 compared to 4.14. This
    bug was introduced during refactoring of blk-wbt code.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Cc: stable@vger.kernel.org
    Cc: Josef Bacik
    Signed-off-by: Harshad Shirwadkar
    Signed-off-by: Jens Axboe

    Harshad Shirwadkar
     

29 Aug, 2019

1 commit


28 Aug, 2019

1 commit


01 May, 2019

1 commit


25 Jan, 2019

1 commit

  • This patch avoids that sparse reports the following warnings:

    CHECK block/blk-wbt.c
    block/blk-wbt.c:600:6: warning: symbol 'wbt_issue' was not declared. Should it be static?
    block/blk-wbt.c:620:6: warning: symbol 'wbt_requeue' was not declared. Should it be static?
    CC block/blk-wbt.o
    block/blk-wbt.c:600:6: warning: no previous prototype for wbt_issue [-Wmissing-prototypes]
    void wbt_issue(struct rq_qos *rqos, struct request *rq)
    ^~~~~~~~~
    block/blk-wbt.c:620:6: warning: no previous prototype for wbt_requeue [-Wmissing-prototypes]
    void wbt_requeue(struct rq_qos *rqos, struct request *rq)
    ^~~~~~~~~~~

    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

17 Dec, 2018

1 commit


12 Dec, 2018

1 commit

  • rwb_enabled() can't be changed when there is any inflight IO.

    wbt_disable_default() may set rwb->wb_normal as zero, however the
    blk_stat timer may still be pending, and the timer function will update
    wrb->wb_normal again.

    This patch introduces blk_stat_deactivate() and applies it in
    wbt_disable_default(), then the following IO hang triggered when running
    parted & switching io scheduler can be fixed:

    [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
    [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
    [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 369.940768] parted D 0 3645 3239 0x00000000
    [ 369.941500] Call Trace:
    [ 369.941874] ? __schedule+0x6d9/0x74c
    [ 369.942392] ? wbt_done+0x5e/0x5e
    [ 369.942864] ? wbt_cleanup_cb+0x16/0x16
    [ 369.943404] ? wbt_done+0x5e/0x5e
    [ 369.943874] schedule+0x67/0x78
    [ 369.944298] io_schedule+0x12/0x33
    [ 369.944771] rq_qos_wait+0xb5/0x119
    [ 369.945193] ? karma_partition+0x1c2/0x1c2
    [ 369.945691] ? wbt_cleanup_cb+0x16/0x16
    [ 369.946151] wbt_wait+0x85/0xb6
    [ 369.946540] __rq_qos_throttle+0x23/0x2f
    [ 369.947014] blk_mq_make_request+0xe6/0x40a
    [ 369.947518] generic_make_request+0x192/0x2fe
    [ 369.948042] ? submit_bio+0x103/0x11f
    [ 369.948486] ? __radix_tree_lookup+0x35/0xb5
    [ 369.949011] submit_bio+0x103/0x11f
    [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
    [ 369.949962] submit_bio_wait+0x53/0x7f
    [ 369.950469] blkdev_issue_flush+0x8a/0xae
    [ 369.951032] blkdev_fsync+0x2f/0x3a
    [ 369.951502] do_fsync+0x2e/0x47
    [ 369.951887] __x64_sys_fsync+0x10/0x13
    [ 369.952374] do_syscall_64+0x89/0x149
    [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 369.953492] RIP: 0033:0x7f95a1e729d4
    [ 369.953996] Code: Bad RIP value.
    [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
    [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
    [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
    [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
    [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

    Cc: stable@vger.kernel.org
    Cc: Paolo Valente
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Dec, 2018

1 commit


16 Nov, 2018

4 commits


08 Nov, 2018

1 commit


12 Oct, 2018

1 commit

  • Tetsuo brought to my attention that I screwed up the scale_up/scale_down
    helpers when I factored out the rq-qos code. We need to wake up all the
    waiters when we add slots for requests to make, not when we shrink the
    slots. Otherwise we'll end up things waiting forever. This was a
    mistake and simply puts everything back the way it was.

    cc: stable@vger.kernel.org
    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    eported-by: Tetsuo Handa
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

28 Aug, 2018

3 commits

  • We already note and mark discard and swap IO from bio_to_wbt_flags().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We have two potential issues:

    1) After commit 2887e41b910b, we only wake one process at the time when
    we finish an IO. We really want to wake up as many tasks as can
    queue IO. Before this commit, we woke up everyone, which could cause
    a thundering herd issue.

    2) A task can potentially consume two wakeups, causing us to (in
    practice) miss a wakeup.

    Fix both by providing our own wakeup function, which stops
    __wake_up_common() from waking up more tasks if we fail to get a
    queueing token. With the strict ordering we have on the wait list, this
    wakes the right tasks and the right amount of tasks.

    Based on a patch from Jianchao Wang .

    Tested-by: Agarwal, Anchal
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Prep patch for calling the handler from a different context,
    no functional changes in this patch.

    Tested-by: Agarwal, Anchal
    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Aug, 2018

4 commits

  • A previous commit removed the ability to have per-rq flags. We used
    those flags to maintain inflight counts. Since we don't have those
    anymore, we have to always maintain inflight counts, even if wbt is
    disabled. This is clearly suboptimal.

    Add a queue quiesce around changing the wbt latency settings from sysfs
    to work around this. With that, we can reliably put the enabled check in
    our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
    consistent for the lifetime of the request.

    Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We need to do this inside the loop as well, or we can allow new
    IO to supersede previous IO.

    Tested-by: Anchal Agarwal
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We need the memory barrier before checking the list head,
    use the appropriate helper for this. The matching queue
    side memory barrier is provided by set_current_state().

    Tested-by: Anchal Agarwal
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Check it in one place, instead of in multiple places.

    Tested-by: Anchal Agarwal
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Aug, 2018

1 commit

  • On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
    should be updated for tracking this IO.

    But commit c1c80384c8f ("block: remove external dependency on wbt_flags")
    forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
    then the inflight counter may not be increased in wbt_wait(), but decreased
    in wbt_done() for this kind of IO, so this counter may become negative, then
    wbt_wait() may wait forever.

    This patch fixes the report in the following link:

    https://marc.info/?l=linux-block&m=153221542021033&w=2

    Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
    Cc: Josef Bacik
    Reported-by: Ming Lei
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Aug, 2018

1 commit

  • I am currently running a large bare metal instance (i3.metal)
    on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
    4.18 kernel. I have a workload that simulates a database
    workload and I am running into lockup issues when writeback
    throttling is enabled,with the hung task detector also
    kicking in.

    Crash dumps show that most CPUs (up to 50 of them) are
    all trying to get the wbt wait queue lock while trying to add
    themselves to it in __wbt_wait (see stack traces below).

    [ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
    [ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
    [ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
    [ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
    [ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
    [ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
    [ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
    [ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
    [ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
    [ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
    [ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
    [ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
    [ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 0.948138] Call Trace:
    [ 0.948139]
    [ 0.948142] do_raw_spin_lock+0xad/0xc0
    [ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b
    [ 0.948149] ? __wake_up_common_lock+0x53/0x90
    [ 0.948150] __wake_up_common_lock+0x53/0x90
    [ 0.948155] wbt_done+0x7b/0xa0
    [ 0.948158] blk_mq_free_request+0xb7/0x110
    [ 0.948161] __blk_mq_complete_request+0xcb/0x140
    [ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme]
    [ 0.948169] nvme_irq+0x23/0x50 [nvme]
    [ 0.948173] __handle_irq_event_percpu+0x46/0x300
    [ 0.948176] handle_irq_event_percpu+0x20/0x50
    [ 0.948179] handle_irq_event+0x34/0x60
    [ 0.948181] handle_edge_irq+0x77/0x190
    [ 0.948185] handle_irq+0xaf/0x120
    [ 0.948188] do_IRQ+0x53/0x110
    [ 0.948191] common_interrupt+0x87/0x87
    [ 0.948192]
    ....
    [ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
    [ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
    [ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
    [ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
    [ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
    [ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
    [ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
    [ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
    [ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
    [ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
    [ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
    [ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
    [ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 0.311154] Call Trace:
    [ 0.311157] do_raw_spin_lock+0xad/0xc0
    [ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b
    [ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0
    [ 0.311164] prepare_to_wait_exclusive+0x28/0xb0
    [ 0.311167] wbt_wait+0x127/0x330
    [ 0.311169] ? finish_wait+0x80/0x80
    [ 0.311172] ? generic_make_request+0xda/0x3b0
    [ 0.311174] blk_mq_make_request+0xd6/0x7b0
    [ 0.311176] ? blk_queue_enter+0x24/0x260
    [ 0.311178] ? generic_make_request+0xda/0x3b0
    [ 0.311181] generic_make_request+0x10c/0x3b0
    [ 0.311183] ? submit_bio+0x5c/0x110
    [ 0.311185] submit_bio+0x5c/0x110
    [ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4]
    [ 0.311210] ext4_io_submit+0x48/0x60 [ext4]
    [ 0.311222] ext4_writepages+0x810/0x11f0 [ext4]
    [ 0.311229] ? do_writepages+0x3c/0xd0
    [ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
    [ 0.311240] do_writepages+0x3c/0xd0
    [ 0.311243] ? _raw_spin_unlock+0x24/0x30
    [ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280
    [ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0
    [ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0
    [ 0.311253] file_write_and_wait_range+0x34/0x90
    [ 0.311264] ext4_sync_file+0x151/0x500 [ext4]
    [ 0.311267] do_fsync+0x38/0x60
    [ 0.311270] SyS_fsync+0xc/0x10
    [ 0.311272] do_syscall_64+0x6f/0x170
    [ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7

    In the original patch, wbt_done is waking up all the exclusive
    processes in the wait queue, which can cause a thundering herd
    if there is a large number of writer threads in the queue. The
    original intention of the code seems to be to wake up one thread
    only however, it uses wake_up_all() in __wbt_done(), and then
    uses the following check in __wbt_wait to have only one thread
    actually get out of the wait loop:

    if (waitqueue_active(&rqw->wait) &&
    rqw->wait.head.next != &wait->entry)
    return false;

    The problem with this is that the wait entry in wbt_wait is
    define with DEFINE_WAIT, which uses the autoremove wakeup function.
    That means that the above check is invalid - the wait entry will
    have been removed from the queue already by the time we hit the
    check in the loop.

    Secondly, auto-removing the wait entries also means that the wait
    queue essentially gets reordered "randomly" (e.g. threads re-add
    themselves in the order they got to run after being woken up).
    Additionally, new requests entering wbt_wait might overtake requests
    that were queued earlier, because the wait queue will be
    (temporarily) empty after the wake_up_all, so the waitqueue_active
    check will not stop them. This can cause certain threads to starve
    under high load.

    The fix is to leave the woken up requests in the queue and remove
    them in finish_wait() once the current thread breaks out of the
    wait loop in __wbt_wait. This will ensure new requests always
    end up at the back of the queue, and they won't overtake requests
    that are already in the wait queue. With that change, the loop
    in wbt_wait is also in line with many other wait loops in the kernel.
    Waking up just one thread drastically reduces lock contention, as
    does moving the wait queue add/remove out of the loop.

    A significant drop in lockdep's lock contention numbers is seen when
    running the test application on the patched kernel.

    Signed-off-by: Anchal Agarwal
    Signed-off-by: Frank van der Linden
    Signed-off-by: Jens Axboe

    Anchal Agarwal
     

09 Jul, 2018

2 commits


09 May, 2018

2 commits

  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • issue_stat is going to go away, so first make writeback throttling take
    the containing request, update the internal wbt helpers accordingly, and
    change rwb->sync_cookie to be the request pointer instead of the
    issue_stat pointer. No functional change.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval