17 Jul, 2020

2 commits

  • The reverse-order double lock dance in ioc_release_fn() is using a
    retry loop. This is a problem on PREEMPT_RT because it could preempt
    the task that would release q->queue_lock and thus live lock in the
    retry loop.

    RCU is already managing the freeing of the request queue and icq. If
    the trylock fails, use RCU to guarantee that the request queue and
    icq are not freed and re-acquire the locks in the correct order,
    allowing forward progress.

    Signed-off-by: John Ogness
    Reviewed-by: Daniel Wagner
    Signed-off-by: Jens Axboe

    John Ogness
     
  • The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
    elevator callback. This led to a lockdep warning, which was fixed in
    commit d8c66c5d5924 ("block: fix lockdep warning on io_context release
    put_io_context()") by using a nested subclass for the ioc spinlock.
    However, with commit f382fb0bcef4 ("block: remove legacy IO schedulers")
    the CFQ IO scheduler no longer exists.

    The BFQ IO scheduler also implements the exit_icq() elevator callback but
    does not call put_io_context().

    The nested subclass for the ioc spinlock is no longer needed. Since it
    existed as an exception and no longer applies, remove the nested subclass
    usage.

    Signed-off-by: John Ogness
    Reviewed-by: Daniel Wagner
    Signed-off-by: Jens Axboe

    John Ogness
     

12 Mar, 2020

1 commit

  • There is a potential race between ioc_release_fn() and
    ioc_clear_queue() as shown below, due to which below kernel
    crash is observed. It also can result into use-after-free
    issue.

    context#1: context#2:
    ioc_release_fn() __ioc_clear_queue() gets the same icq
    ->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock);
    ->ioc_destroy_icq(icq);
    ->list_del_init(&icq->q_node);
    ->call_rcu(&icq->__rcu_head,
    icq_free_icq_rcu);
    ->spin_unlock(&ioc->lock);
    ->ioc_destroy_icq(icq);
    ->hlist_del_init(&icq->ioc_node);
    This results into below crash as this memory
    is now used by icq->__rcu_head in context#1.
    There is a chance that icq could be free'd
    as well.

    22150.386550: Unable to handle kernel write to read-only memory
    at virtual address ffffffaa8d31ca50
    ...
    Call trace:
    22150.607350: ioc_destroy_icq+0x44/0x110
    22150.611202: ioc_clear_queue+0xac/0x148
    22150.615056: blk_cleanup_queue+0x11c/0x1a0
    22150.619174: __scsi_remove_device+0xdc/0x128
    22150.623465: scsi_forget_host+0x2c/0x78
    22150.627315: scsi_remove_host+0x7c/0x2a0
    22150.631257: usb_stor_disconnect+0x74/0xc8
    22150.635371: usb_unbind_interface+0xc8/0x278
    22150.639665: device_release_driver_internal+0x198/0x250
    22150.644897: device_release_driver+0x24/0x30
    22150.649176: bus_remove_device+0xec/0x140
    22150.653204: device_del+0x270/0x460
    22150.656712: usb_disable_device+0x120/0x390
    22150.660918: usb_disconnect+0xf4/0x2e0
    22150.664684: hub_event+0xd70/0x17e8
    22150.668197: process_one_work+0x210/0x480
    22150.672222: worker_thread+0x32c/0x4c8

    Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
    indicate this icq is once marked as destroyed. Also, ensure
    __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
    that icq doesn't get free'd up while it is still using it.

    Signed-off-by: Sahitya Tummala
    Co-developed-by: Pradeep P V K
    Signed-off-by: Pradeep P V K
    Signed-off-by: Jens Axboe

    Sahitya Tummala
     

16 Nov, 2018

2 commits


10 Nov, 2018

1 commit

  • Fixes gcc '-Wunused-but-set-variable' warning:

    block/blk-ioc.c: In function 'put_io_context_active':
    block/blk-ioc.c:174:24: warning:
    variable 'et' set but not used [-Wunused-but-set-variable]

    It not used any more after commit
    a1ce35fa4985 ("block: remove dead elevator code")

    Signed-off-by: YueHaibing
    Signed-off-by: Jens Axboe

    YueHaibing
     

08 Nov, 2018

2 commits

  • This is a remnant of when we had ops for both SQ and MQ
    schedulers. Now it's just MQ, so get rid of the union.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This removes a bunch of core and elevator related code. On the core
    front, we remove anything related to queue running, draining,
    initialization, plugging, and congestions. We also kill anything
    related to request allocation, merging, retrieval, and completion.

    Remove any checking for single queue IO schedulers, as they no
    longer exist. This means we can also delete a bunch of code related
    to request issue, adding, completion, etc - and all the SQ related
    ops and helpers.

    Also kill the load_default_modules(), as all that did was provide
    for a way to load the default single queue elevator.

    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jul, 2018

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Mar, 2017

1 commit

  • Pull block layer fixes from Jens Axboe:
    "A collection of fixes for this merge window, either fixes for existing
    issues, or parts that were waiting for acks to come in. This pull
    request contains:

    - Allocation of nvme queues on the right node from Shaohua.

    This was ready long before the merge window, but waiting on an ack
    from Bjorn on the PCI bit. Now that we have that, the three patches
    can go in.

    - Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
    request allocations. This caused an oops. One part from Sagi, one
    part from Omar.

    - A loop partition scan deadlock fix from Omar, fixing a regression
    in this merge window.

    - A three-patch series from Keith, closing up a hole on clearing out
    requests on shutdown/resume.

    - A stable fix for nbd from Josef, fixing a leak of sockets.

    - Two fixes for a regression in this window from Jan, fixing a
    problem with one of his earlier patches dealing with queue vs bdi
    life times.

    - A fix for a regression with virtio-blk, causing an IO stall if
    scheduling is used. From me.

    - A fix for an io context lock ordering problem. From me"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: Move bdi_unregister() to del_gendisk()
    blk-mq: ensure that bd->last is always set correctly
    block: don't call ioc_exit_icq() with the queue lock held for blk-mq
    block: Initialize bd_bdi on inode initialization
    loop: fix LO_FLAGS_PARTSCAN hang
    nvme: Complete all stuck requests
    blk-mq: Provide freeze queue timeout
    blk-mq: Export blk_mq_freeze_queue_wait
    nbd: stop leaking sockets
    blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
    blk-mq: kill blk_mq_set_alloc_data()
    blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
    blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
    nvme: allocate nvme_queue in correct node
    PCI: add an API to get node from vector
    blk-mq: allocate blk_mq_tags and requests in correct node

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • For legacy scheduling, we always call ioc_exit_icq() with both the
    ioc and queue lock held. This poses a problem for blk-mq with
    scheduling, since the queue lock isn't what we use in the scheduler.
    And since we don't need the queue lock held for ioc exit there,
    don't grab it and leave any extra locking up to the blk-mq scheduler.

    Reported-by: Paolo Valente
    Tested-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Mar, 2017

1 commit


11 Feb, 2017

1 commit


18 Jan, 2017

2 commits

  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • Prep patch for adding MQ ops as well, since doing anon unions with
    named initializers doesn't work on older compilers.

    Signed-off-by: Jens Axboe
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

19 Feb, 2014

1 commit

  • (Trivial patch.)

    If the code is looking at the RCU-protected pointer itself, but not
    dereferencing it, the rcu_dereference() functions can be downgraded to
    rcu_access_pointer(). This commit makes this downgrade in blkg_destroy()
    and ioc_destroy_icq(), both of which simply compare the RCU-protected
    pointer against another pointer with no dereferencing.

    Signed-off-by: Paul E. McKenney
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Paul E. McKenney
     

09 Nov, 2013

1 commit


12 Sep, 2013

1 commit

  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

15 May, 2013

1 commit

  • Block layer uses workqueues for multiple purposes. There is no real dependency
    of scheduling these on the cpu which scheduled them.

    On a idle system, it is observed that and idle cpu wakes up many times just to
    service this work. It would be better if we can schedule it on a cpu which the
    scheduler believes to be the most appropriate one.

    This patch replaces normal workqueues with power efficient versions.

    Cc: Jens Axboe
    Signed-off-by: Viresh Kumar
    Signed-off-by: Tejun Heo

    Viresh Kumar
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

01 Aug, 2012

1 commit

  • Hi,

    I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
    below warning as of 3.5-rc1 and later (3.4 is fine):

    [ 10.886893] ------------[ cut here ]------------
    [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
    [ 10.886905] Hardware name: Bochs
    [ 10.886906] Modules linked in:
    [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
    [ 10.886908] Call Trace:
    [ 10.886911] [] warn_slowpath_common+0x7a/0xb0
    [ 10.886912] [] warn_slowpath_null+0x15/0x20
    [ 10.886913] [] copy_process+0x1488/0x1560
    [ 10.886914] [] do_fork+0xb4/0x340
    [ 10.886918] [] ? recalc_sigpending+0x1a/0x50
    [ 10.886919] [] ? __set_task_blocked+0x32/0x80
    [ 10.886920] [] ? __set_current_blocked+0x3a/0x60
    [ 10.886923] [] sys_clone+0x23/0x30
    [ 10.886925] [] stub_clone+0x13/0x20
    [ 10.886927] [] ? system_call_fastpath+0x16/0x1b
    [ 10.886928] ---[ end trace 32a14af7ee6a590b ]---

    Reproducing is easy, I can hit it on a KVM system with a very basic
    config (x86_64 make defconfig + enable the drivers needed). To hit it,
    just install dump (on debian/ubuntu, not sure what the package might be
    called on Fedora), and:

    dump -o -f /tmp/foo /

    You'll see the warning in dmesg once it forks off the I/O process and
    starts dumping filesystem contents.

    I bisected it down to the following commit:

    commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
    Author: Tejun Heo
    Date: Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks. This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task. This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    It seems like the init of ioc->nr_tasks was removed in that patch,
    so it starts out at 0 instead of 1.

    Tejun, is the right thing here to add back the init, or should something else
    be done?

    The below patch removes the warning, but I haven't done any more extensive
    testing on it.

    Signed-off-by: Olof Johansson
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Olof Johansson
     

31 May, 2012

1 commit

  • Calling get_task_io_context() on a exiting task which isn't %current can
    loop forever. This triggers at boot time on my dev machine.

    BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]

    Fix this by making create_task_io_context() returns -EBUSY in this case
    to break the loop.

    Signed-off-by: Eric Dumazet
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Alan Cox
    Signed-off-by: Jens Axboe

    Eric Dumazet
     

02 Apr, 2012

1 commit

  • cgroup/for-3.5 contains the following changes which blk-cgroup needs
    to proceed with the on-going cleanup.

    * Dynamic addition and removal of cftypes to make config/stat file
    handling modular for policies.

    * cgroup removal update to not wait for css references to drain to fix
    blkcg removal hang caused by cfq caching cfqgs.

    Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
    following conflicts in block/blk-cgroup.c.

    * 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
    conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
    removal. Resolved by removing @subsys from all subsys methods.

    * 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
    controllers" conflicts with ->pre_destroy() and ->attach() updates
    and removal of modular config. Resolved by dropping forward
    declarations of the methods and applying updates to the relocated
    blkio_subsys.

    * 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
    cftype interface" builds upon the previous item. Resolved by adding
    ->base_cftypes to the relocated blkio_subsys.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

20 Mar, 2012

1 commit

  • After the previous patch to cfq, there's no ioc_get_changed() user
    left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all
    related stuff.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Mar, 2012

1 commit

  • When put_io_context is called, if ioc->icq_list is empty and refcount
    is 1, kernel will not free the ioc.

    This is caught by following kmemleak:

    unreferenced object 0xffff880036349fe0 (size 216):
    comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
    backtrace:
    [] kmemleak_alloc+0x26/0x50
    [] kmem_cache_alloc_node+0x1cc/0x2a0
    [] create_io_context_slowpath+0x27/0x130
    [] get_task_io_context+0xbb/0xf0
    [] copy_process+0x188e/0x18b0
    [] do_fork+0x11b/0x420
    [] sys_clone+0x28/0x30
    [] stub_clone+0x13/0x20
    [] 0xffffffffffffffff

    ioc should be freed if ioc->icq_list is empty.
    Signed-off-by: Xiaotian Feng
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Xiaotian Feng
     

07 Mar, 2012

2 commits

  • Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks. This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task. This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Make the following interface updates to prepare for future ioc related
    changes.

    * create_io_context() returning ioc only works for %current because it
    doesn't increment ref on the ioc. Drop @task parameter from it and
    always assume %current.

    * Make create_io_context_slowpath() return 0 or -errno and rename it
    to create_task_io_context().

    * Make ioc_create_icq() take @ioc as parameter instead of assuming
    that of %current. The caller, get_request(), is updated to create
    ioc explicitly and then pass it into ioc_create_icq().

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Feb, 2012

3 commits

  • While updating locking, b2efa05265 "block, cfq: unlink
    cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
    from exit_io_context() to the final ioc put. While this doesn't cause
    catastrophic failure, it effectively removes task exit notification to
    elevator and cause noticeable IO performance degradation with CFQ.

    On task exit, CFQ used to immediately expire the slice if it was being
    used by the exiting task as no more IO would be issued by the task;
    however, after b2efa05265, the notification is lost and disk could sit
    idle needlessly, leading to noticeable IO performance degradation for
    certain workloads.

    This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
    elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
    from exit_io_context(). ICQ_EXITED flag is added to avoid invoking
    the callback more than once for the same icq.

    Walking icq_list from ioc side and invoking elevator callback requires
    reverse double locking. This may be better implemented using RCU;
    unfortunately, using RCU isn't trivial. e.g. RCU protection would
    need to cover request_queue and queue_lock switch on cleanup makes
    grabbing queue_lock from RCU unsafe. Reverse double locking should
    do, at least for now.

    Signed-off-by: Tejun Heo
    Reported-and-bisected-by: Shaohua Li
    LKML-Reference:
    Tested-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Reverse double lock dancing in ioc_release_fn() can be simplified by
    just using trylock on the queue_lock and back out from ioc lock on
    trylock failure. Simplify it.

    Signed-off-by: Tejun Heo
    Tested-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and
    access it under ioc->lock instead of using atomic bitops.
    ioc_get_changed() is added so that the changed part can be fetched and
    cleared as before.

    icq->flags will be used to carry other flags.

    Signed-off-by: Tejun Heo
    Tested-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Feb, 2012

1 commit

  • 11a3122f6c "block: strip out locking optimization in put_io_context()"
    removed ioc_lock depth lockdep annoation along with locking
    optimization; however, while recursing from put_io_context() is no
    longer possible, ioc_release_fn() may still end up putting the last
    reference of another ioc through elevator, which wlil grab ioc->lock
    triggering spurious (as the ioc is always different one) A-A deadlock
    warning.

    As this can only happen one time from ioc_release_fn(), using non-zero
    subclass from ioc_release_fn() is enough. Use subclass 1.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Feb, 2012

1 commit

  • put_io_context() performed a complex trylock dancing to avoid
    deferring ioc release to workqueue. It was also broken on UP because
    trylock was always assumed to succeed which resulted in unbalanced
    preemption count.

    While there are ways to fix the UP breakage, even the most
    pathological microbench (forced ioc allocation and tight fork/exit
    loop) fails to show any appreciable performance benefit of the
    optimization. Strip it out. If there turns out to be workloads which
    are affected by this change, simpler optimization from the discussion
    thread can be applied later.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 Feb, 2012

1 commit

  • Meelis reported a warning:

    WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
    Hardware name: 939Dual-SATA2
    timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
    Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
    Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176
    Call Trace:
    [] warn_slowpath_common+0x7e/0x96
    [] ? cfq_slice_expired+0x1d/0x1d
    [] warn_slowpath_fmt+0x41/0x43
    [] ? cfq_idle_slice_timer+0xa1/0xaa
    [] ? cfq_slice_expired+0x1d/0x1d
    [] run_timer_softirq+0x199/0x1ec
    [] ? timekeeping_get_ns+0x12/0x31
    [] ? apic_write+0x11/0x13
    [] __do_softirq+0x74/0xfa
    [] call_softirq+0x1a/0x30
    [] do_softirq+0x31/0x68
    [] irq_exit+0x3d/0xa3
    [] smp_apic_timer_interrupt+0x6b/0x77
    [] apic_timer_interrupt+0x69/0x70
    [] ? sched_clock_cpu+0x73/0x7d
    [] ? sched_clock_cpu+0x73/0x7d
    [] ? default_idle+0x1e/0x32
    [] ? default_idle+0x18/0x32
    [] cpu_idle+0x87/0xd1
    [] rest_init+0x85/0x89
    [] start_kernel+0x2eb/0x2f8
    [] x86_64_start_reservations+0x7e/0x82
    [] x86_64_start_kernel+0xf0/0xf7

    this_q == locked_q is possible. There are two problems here:
    1. In UP case, there is preemption counter issue as spin_trylock always
    successes.
    2. In SMP case, the loop breaks too earlier.

    Signed-off-by: Shaohua Li
    Reported-by: Meelis Roos
    Reported-by: Knut Petersen
    Tested-by: Knut Petersen
    Signed-off-by: Jens Axboe

    Shaohua Li
     

28 Dec, 2011

1 commit

  • 6e736be7 "block: make ioc get/put interface more conventional and fix
    race on alloction" added WARN_ON_ONCE() in exit_io_context() which
    triggers if !PF_EXITING. All tasks hitting exit_io_context() from
    task exit should have PF_EXITING set but task struct tearing down
    after fork failure calls into the function without PF_EXITING,
    triggering the condition.

    WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
    Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
    Call Trace:
    [] warn_slowpath_common+0x8f/0xb2
    [] warn_slowpath_null+0x18/0x1a
    [] exit_io_context+0x40/0x92
    [] copy_process+0x126f/0x1453
    [] do_fork+0x120/0x3e9
    [] sys_clone+0x26/0x28
    [] stub_clone+0x13/0x20
    ---[ end trace a2e4eb670b375238 ]---

    Reported-by: Sasha Levin
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Dec, 2011

1 commit

  • While fixing io_context creation / task exit race condition,
    6e736be7f2 "block: make ioc get/put interface more conventional and
    fix race on alloction" also prevented an exiting (%PF_EXITING) task
    from creating its own io_context. This is incorrect as exit path may
    issue IOs, e.g. from exit_files(), and if those IOs are the first ones
    issued by the task, io_context needs to be created to process the IOs.

    Combined with the existing problem of io_context / io_cq creation
    failure having the possibility of stalling IO, this problem results in
    deterministic full IO lockup with certain workloads.

    Fix it by allowing io_context creation regardless of %PF_EXITING for
    %current.

    Signed-off-by: Tejun Heo
    Reported-by: Andrew Morton
    Reported-by: Hugh Dickins
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 Dec, 2011

1 commit


14 Dec, 2011

2 commits

  • Now block layer knows everything necessary to create and associate
    icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
    get_request() such that, if elevator_type->icq_size is set, requests
    are automatically associated with their matching icq's before
    elv_set_request(). io_context reference is also managed by block core
    on request alloc/free.

    * Only ioprio/cgroup changed handling remains from cfq_get_cic().
    Collapsed into cfq_set_request().

    * This removes queue kicking on icq allocation failure (for now). As
    icq allocation failure is rare and the only effect of queue kicking
    achieved was possibily accelerating queue processing, this change
    shouldn't be noticeable.

    There is a larger underlying problem. Unlike request allocation,
    icq allocation is not guaranteed to succeed eventually after
    retries. The number of icq is unbound and thus mempool can't be the
    solution either. This effectively adds allocation dependency on
    memory free path and thus possibility of deadlock.

    This usually wouldn't happen because icq allocation is not a hot
    path and, even when the condition triggers, it's highly unlikely
    that none of the writeback workers already has icq.

    However, this is still possible especially if elevator is being
    switched under high memory pressure, so we better get it fixed.
    Probably the only solution is just bypassing elevator and appending
    to dispatch queue on any elevator allocation failure.

    * Comment added to explain how icq's are managed and synchronized.

    This completes cleanup of io_context interface.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
    blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
    with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
    and q, and freeing automatically handled by blk-ioc. The elevator
    operation only need to perform exit operation specific to the elevator
    - in cfq's case, exiting the cfqq's.

    Also, clearing of io_cq's on q detach is moved to block core and
    automatically performed on elevator switch and q release.

    Because the q io_cq points to might be freed before RCU callback for
    the io_cq runs, blk-ioc code should remember to which cache the io_cq
    needs to be freed when the io_cq is released. New field
    io_cq->__rcu_icq_cache is added for this purpose. As both the new
    field and rcu_head are used only after io_cq is released and the
    q/ioc_node fields aren't, they are put into unions.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo