02 Jun, 2015

17 commits

  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->wb_lock and ->worklist into wb.

    * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
    moving, rename it to wb->work_lock as wb->wb_lock is confusing.
    Also, move wb->dwork downwards so that it's colocated with the new
    ->work_lock and ->work_list fields.

    * bdi_writeback_workfn() -> wb_workfn()
    bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
    bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
    bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
    __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
    get_next_work_item(bdi) -> get_next_work_item(wb)

    * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
    The function contained parts which belong to the containing bdi
    rather than the wb itself - testing cap_writeback_dirty and
    bdi_remove_from_list() invocation. Those are moved to
    bdi_unregister().

    * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
    Initializations of the moved bdi->wb_lock and ->work_list are
    relocated from bdi_init() to wb_init().

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->state are mechanically replaced with bdi->wb.state
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Writeback operations will now be per wb (bdi_writeback) instead of
    bdi. Replace the relevant bdi references in symbol names and comments
    with wb. This patch is purely cosmetic and doesn't make any
    functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Wu Fengguang
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bandwidth related fields from backing_dev_info into
    bdi_writeback.

    * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
    write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
    balanced_dirty_ratelimit, completions and dirty_exceeded.

    * writeback_chunk_size() and over_bground_thresh() now take @wb
    instead of @bdi.

    * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
    bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
    bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
    bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
    [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
    bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
    bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

    * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
    respectively. Note that explicit zeroing is dropped in the process
    as wb's are cleared in entirety anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    v2: Typo in description fixed as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->bdi_stat[] into wb.

    * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
    enums is changed from BDI_ to WB_.

    * BDI_STAT_BATCH() -> WB_STAT_BATCH()

    * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

    * bdi_stat[_error]() -> wb_stat[_error]()

    * bdi_writeout_inc() -> wb_writeout_inc()

    * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
    frees stat.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Miklos Szeredi
    Cc: Trond Myklebust
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->state into wb.

    * enum bdi_state is renamed to wb_state and the prefix of all enums is
    changed from BDI_ to WB_.

    * Explicit zeroing of bdi->state is removed without adding zeoring of
    wb->state as the whole data structure is zeroed on init anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->state are mechanically replaced with bdi->wb.state
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: drbd-dev@lists.linbit.com
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Implement mem_cgroup_css_from_page() which returns the
    cgroup_subsys_state of the memcg associated with a given page on the
    default hierarchy. This will be used by cgroup writeback support.

    This function assumes that page->mem_cgroup association doesn't change
    until the page is released, which is true on the default hierarchy as
    long as replace_page_cache_page() is not used. As the only user of
    replace_page_cache_page() is FUSE which won't support cgroup writeback
    for the time being, this works for now, and replace_page_cache_page()
    will soon be updated so that the invariant actually holds.

    Note that the RCU protected page->mem_cgroup access is consistent with
    other usages across memcg but ultimately incorrect. These unlocked
    accesses are missing required barriers. page->mem_cgroup should be
    made an RCU pointer and updated and accessed using RCU operations.

    v4: Instead of triggering WARN, return the root css on the traditional
    hierarchies. This makes the function a lot easier to deal with
    especially as there's no light way to synchronize against
    hierarchy rebinding.

    v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/

    v2: Trigger WARN if the function is used on the traditional
    hierarchies and add comment about the assumed invariant.

    Signed-off-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bio can only be associated with the io_context and blkcg
    of %current using bio_associate_current(). This is too restrictive
    for cgroup writeback support. Implement bio_associate_blkcg() which
    associates a bio with the specified blkcg.

    bio_associate_blkcg() leaves the io_context unassociated.
    bio_associate_current() is updated so that it considers a bio as
    already associated if it has a blkcg_css, instead of an io_context,
    associated with it.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Implement a wrapper around task_get_css() to acquire the blkcg css for
    a given task. The wrapper is necessary for cgroup writeback support
    as there will be places outside blkcg proper trying to acquire
    blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio_associate_current() currently open codes task_css() and
    css_tryget_online() to find and pin $current's blkcg css. Abstract it
    into task_get_css() which is implemented from cgroup side. As a task
    is always associated with an online css for every subsystem except
    while the css_set update is propagating, task_get_css() retries till
    css_tryget_online() succeeds.

    This is a cleanup and shouldn't lead to noticeable behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add global constant blkcg_root_css which points to &blkcg_root.css.
    This will be used by cgroup writeback support. If blkcg is disabled,
    it's defined as ERR_PTR(-EINVAL).

    v2: The declarations moved to include/linux/blk-cgroup.h as suggested
    by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add global mem_cgroup_root_css which points to the root memcg css.
    This will be used by cgroup writeback support. If memcg is disabled,
    it's defined as ERR_PTR(-EINVAL).

    Signed-off-by: Tejun Heo
    Cc: Johannes Weiner
    aCc: Michal Hocko
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkcg does a minor optimization where the root blkcg is
    created when the first blkcg policy is activated on a queue and
    destroyed on the deactivation of the last. On systems where blkcg is
    configured but not used, this saves one blkcg_gq struct per queue. On
    systems where blkcg is actually used, there's no difference. The only
    case where this can lead to any meaninful, albeit still minute, save
    in memory consumption is when all blkcg policies are deactivated after
    being widely used in the system, which is a hihgly unlikely scenario.

    The conditional existence of root blkcg_gq has already created several
    bugs in blkcg and became an issue once again for the new per-cgroup
    wb_congested mechanism for cgroup writeback support leading to a NULL
    dereference when no blkcg policy is active. This is really not worth
    bothering with. This patch makes blkcg always allocate and link the
    root blkcg_gq and release it only on queue destruction.

    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The header file will be used more widely with the pending cgroup
    writeback support and the current set of dummy declarations aren't
    enough to handle different config combinations. Update as follows.

    * Drop the struct cgroup declaration. None of the dummy defs need it.

    * Define blkcg as an empty struct instead of just declaring it.

    * Wrap dummy function defs in CONFIG_BLOCK. Some functions use block
    data types and none of them are to be used w/o block enabled.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup aware writeback support will require exposing some of blkcg
    details. In preprataion, move block/blk-cgroup.h to
    include/linux/blk-cgroup.h. This patch is pure file move.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen
     
  • cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback:
    clean up mess around cancel_dirty_page()") replaced it with
    account_page_cleaned() which makes the caller responsible for clearing
    the dirty bit; unfortunately, the planned changes for cgroup writeback
    support requires synchronization between dirty bit manipulation and
    stat updates. While we can open-code such synchronization in each
    account_page_cleaned() callsite, that's gonna be unnecessarily awkward
    and verbose.

    This patch revives cancel_dirty_page() but in a more restricted form.
    All it does is TestClearPageDirty() followed by account_page_cleaned()
    invocation if the page was dirty. This helper covers all
    account_page_cleaned() usages except for __delete_from_page_cache()
    which is a special case anyway and left alone. As this leaves no
    module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
    from it.

    This patch just revives cancel_dirty_page() as a trivial wrapper to
    replace equivalent usages and doesn't introduce any functional
    changes.

    Signed-off-by: Tejun Heo
    Cc: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Storage controllers may expose multiple block devices that share hardware
    resources managed by blk-mq. This patch enhances the shared tags so a
    low-level driver can access the shared resources not tied to the unshared
    h/w contexts. This way the LLD can dynamically add and delete disks and
    request queues without having to track all the request_queue hctx's to
    iterate outstanding tags.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

30 May, 2015

2 commits


27 May, 2015

1 commit


22 May, 2015

2 commits

  • Currently dm-multipath has to clone the bios for every request sent
    to the lower devices, which wastes cpu cycles and ties down memory.

    This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
    to not complete bios attached to a request, which we set on clone
    requests similar to bios in a flush sequence. With this change I/O
    errors on a path failure only get propagated to dm-multipath, which
    can then either resubmit the I/O or complete the bios on the original
    request.

    I've done some basic testing of this on a Linux target with ALUA support,
    and it survives path failures during I/O nicely.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
    non-chains") regressed all existing callers that followed this pattern:
    1) saving a bio's original bi_end_io
    2) wiring up an intermediate bi_end_io
    3) restoring the original bi_end_io from intermediate bi_end_io
    4) calling bio_endio() to execute the restored original bi_end_io

    The regression was due to BIO_CHAIN only ever getting set if
    bio_inc_remaining() is called. For the above pattern it isn't set until
    step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
    the first bio_endio(), in step 2 above, never decremented __bi_remaining
    before calling the intermediate bi_end_io -- leaving __bi_remaining with
    the value 1 instead of 0. When bio_inc_remaining() occurred during step
    3 it brought it to a value of 2. When the second bio_endio() was
    called, in step 4 above, it should've called the original bi_end_io but
    it didn't because there was an extra reference that wasn't dropped (due
    to atomic operations being optimized away since BIO_CHAIN wasn't set
    upfront).

    Fix this issue by removing the __bi_remaining management complexity for
    all callers that use the above pattern -- bio_chain() is the only
    interface that _needs_ to be concerned with __bi_remaining. For the
    above pattern callers just expect the bi_end_io they set to get called!
    Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
    that aren't associated with the bio_chain() interface.

    Also, the bio_inc_remaining() interface has been moved local to bio.c.

    Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

20 May, 2015

2 commits

  • The only possible problem of using mutex_lock() instead of trylock
    is about deadlock.

    If there aren't any locks held before calling blkdev_reread_part(),
    deadlock can't be caused by this conversion.

    If there are locks held before calling blkdev_reread_part(),
    and if these locks arn't required in open, close handler and I/O
    path, deadlock shouldn't be caused too.

    Both user space's ioctl(BLKRRPART) and md_setup_drive() from
    init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
    for the two cases.

    For loop, the previous patches in this pathset has fixed the ABBA lock
    dependency, so the conversion is OK.

    For nbd, tx_lock is held when calling the function:

    - both open and release won't hold the lock
    - when blkdev_reread_part() is run, I/O thread has been stopped
    already, so tx_lock won't be acquired in I/O path at that time.
    - so the conversion won't cause deadlock for nbd

    For dasd, both dasd_open(), dasd_release() and request function don't
    acquire any mutex/semphone, so the conversion should be safe.

    Reviewed-by: Christoph Hellwig
    Tested-by: Jarod Wilson
    Acked-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch exports blkdev_reread_part() for block drivers, also
    introduce __blkdev_reread_part().

    For some drivers, such as loop, reread of partitions can be run
    from the release path, and bd_mutex may already be held prior to
    calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
    __blkdev_reread_part for use in such cases.

    CC: Christoph Hellwig
    CC: Jens Axboe
    CC: Tejun Heo
    CC: Alexander Viro
    CC: Markus Pargmann
    CC: Stefan Weinhuber
    CC: Stefan Haberland
    CC: Sebastian Ott
    CC: Fabian Frederick
    CC: Ming Lei
    CC: David Herrmann
    CC: Andrew Morton
    CC: Peter Zijlstra
    CC: nbd-general@lists.sourceforge.net
    CC: linux-s390@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jarod Wilson
     

19 May, 2015

5 commits


09 May, 2015

6 commits

  • Last patch makes plug work for multiple queue case. However it only
    works for single disk case, because it assumes only one request in the
    plug list. If a task is accessing multiple disks, eg MD/DM, the
    assumption is wrong. Let blk_attempt_plug_merge() record request from
    the same queue.

    V2: use NULL parameter in !mq case. Fix a bug. Add comments in
    blk_attempt_plug_merge to make it less (hopefully) confusion.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • plug is still helpful for workload with IO merge, but it can be harmful
    otherwise especially with multiple hardware queues, as there is
    (supposed) no lock contention in this case and plug can introduce
    latency. For multiple queues, we do limited plug, eg plug only if there
    is request merge. If a request doesn't have merge with following
    request, the requet will be dispatched immediately.

    V2: check blk_queue_nomerges() as suggested by Jeff.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • If we directly issue a request and it fails, we use
    blk_mq_merge_queue_io(). But we already assigned bio to a request in
    blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
    blk_mq_bio_to_request again.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • The following appears in blk_sq_make_request:

    /*
    * If we have multiple hardware queues, just go directly to
    * one of those for sync IO.
    */

    We clearly don't have multiple hardware queues, here! This comment was
    introduced with this commit 07068d5b8e (blk-mq: split make request
    handler for multi and single queue):

    We want slightly different behavior from them:

    - On single queue devices, we currently use the per-process plug
    for deferred IO and for merging.

    - On multi queue devices, we don't use the per-process plug, but
    we want to go straight to hardware for SYNC IO.

    The old code had this:

    use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

    and that was converted to:

    use_plug = !is_flush_fua && !is_sync;

    which is not equivalent. For the single queue case, that second half of
    the && expression is always true. So, what I think was actually inteded
    follows (and this more closely matches what is done in blk_queue_bio).

    V2: delete the 'likely', which should not be a big deal

    Signed-off-by: Jeff Moyer
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • block plug callback could sleep, so we introduce a parameter
    'from_schedule' and corresponding drivers can use it to destinguish a
    schedule plug flush or a plug finish. Unfortunately io_schedule_out
    still uses blk_flush_plug(). This causes below output (Note, I added a
    might_sleep() in raid1_unplug to make it trigger faster, but the whole
    thing doesn't matter if I add might_sleep). In raid1/10, this can cause
    deadlock.

    This patch makes io_schedule_out always uses blk_schedule_flush_plug.
    This should only impact drivers (as far as I know, raid 1/10) which are
    sensitive to the 'from_schedule' parameter.

    [ 370.817949] ------------[ cut here ]------------
    [ 370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
    [ 370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [] prepare_to_wait+0x2f/0x90
    [ 370.817971] Modules linked in: raid1
    [ 370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G W 4.0.0+ #361
    [ 370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
    [ 370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
    [ 370.817985] ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
    [ 370.817988] ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
    [ 370.817990] ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
    [ 370.817993] Call Trace:
    [ 370.817999] [] dump_stack+0x4f/0x7b
    [ 370.818002] [] warn_slowpath_common+0x8c/0xd0
    [ 370.818004] [] warn_slowpath_fmt+0x46/0x50
    [ 370.818006] [] ? prepare_to_wait+0x2f/0x90
    [ 370.818008] [] ? prepare_to_wait+0x2f/0x90
    [ 370.818010] [] __might_sleep+0x7f/0x90
    [ 370.818014] [] raid1_unplug+0xd3/0x170 [raid1]
    [ 370.818024] [] blk_flush_plug_list+0x8a/0x1e0
    [ 370.818028] [] ? bit_wait+0x50/0x50
    [ 370.818031] [] io_schedule_timeout+0x130/0x140
    [ 370.818033] [] bit_wait_io+0x36/0x50
    [ 370.818034] [] __wait_on_bit+0x65/0x90
    [ 370.818041] [] ? ext4_read_block_bitmap_nowait+0xbc/0x630
    [ 370.818043] [] ? bit_wait+0x50/0x50
    [ 370.818045] [] out_of_line_wait_on_bit+0x72/0x80
    [ 370.818047] [] ? autoremove_wake_function+0x40/0x40
    [ 370.818050] [] __wait_on_buffer+0x44/0x50
    [ 370.818053] [] ext4_wait_block_bitmap+0xe0/0xf0
    [ 370.818058] [] ext4_mb_init_cache+0x206/0x790
    [ 370.818062] [] ? lru_cache_add+0x1c/0x50
    [ 370.818064] [] ext4_mb_init_group+0x11e/0x200
    [ 370.818066] [] ext4_mb_load_buddy+0x341/0x360
    [ 370.818068] [] ext4_mb_find_by_goal+0x93/0x2f0
    [ 370.818070] [] ? ext4_mb_normalize_request+0x1e4/0x5b0
    [ 370.818072] [] ext4_mb_regular_allocator+0x67/0x460
    [ 370.818074] [] ? ext4_mb_normalize_request+0x1e4/0x5b0
    [ 370.818076] [] ext4_mb_new_blocks+0x4cb/0x620
    [ 370.818079] [] ext4_ext_map_blocks+0x4c6/0x14d0
    [ 370.818081] [] ? ext4_es_lookup_extent+0x4e/0x290
    [ 370.818085] [] ext4_map_blocks+0x14d/0x4f0
    [ 370.818088] [] ext4_writepages+0x76d/0xe50
    [ 370.818094] [] do_writepages+0x21/0x50
    [ 370.818097] [] __writeback_single_inode+0x60/0x490
    [ 370.818099] [] writeback_sb_inodes+0x2da/0x590
    [ 370.818103] [] ? trylock_super+0x1b/0x50
    [ 370.818105] [] ? trylock_super+0x1b/0x50
    [ 370.818107] [] __writeback_inodes_wb+0x9f/0xd0
    [ 370.818109] [] wb_writeback+0x34b/0x3c0
    [ 370.818111] [] bdi_writeback_workfn+0x23f/0x550
    [ 370.818116] [] process_one_work+0x1c8/0x570
    [ 370.818117] [] ? process_one_work+0x14b/0x570
    [ 370.818119] [] worker_thread+0x11b/0x470
    [ 370.818121] [] ? process_one_work+0x570/0x570
    [ 370.818124] [] kthread+0xf8/0x110
    [ 370.818126] [] ? kthread_create_on_node+0x210/0x210
    [ 370.818129] [] ret_from_fork+0x42/0x70
    [ 370.818131] [] ? kthread_create_on_node+0x210/0x210
    [ 370.818132] ---[ end trace 7b4deb71e68b6605 ]---

    V2: don't change ->in_iowait

    Cc: NeilBrown
    Signed-off-by: Shaohua Li
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Current code looks like inner plug gets flushed with a
    blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
    to current->plug, while only outmost plug is assigned to current->plug.
    So inner plug always has empty request/callback list, which makes
    blk_flush_plug_list() a nop. This tries to make the code more clear.

    Signed-off-by: Shaohua Li
    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Shaohua Li
     

06 May, 2015

5 commits