16 Nov, 2018
2 commits
-
With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph HellwigFixed floppy and blk-cgroup missing conversions and half done edits.
Signed-off-by: Jens Axboe
-
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
10 Nov, 2018
1 commit
-
Fixes gcc '-Wunused-but-set-variable' warning:
block/blk-ioc.c: In function 'put_io_context_active':
block/blk-ioc.c:174:24: warning:
variable 'et' set but not used [-Wunused-but-set-variable]It not used any more after commit
a1ce35fa4985 ("block: remove dead elevator code")Signed-off-by: YueHaibing
Signed-off-by: Jens Axboe
08 Nov, 2018
2 commits
-
This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
09 Jul, 2018
1 commit
-
The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.Signed-off-by: Shakeel Butt
Signed-off-by: Jens Axboe
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
04 Mar, 2017
1 commit
-
Pull block layer fixes from Jens Axboe:
"A collection of fixes for this merge window, either fixes for existing
issues, or parts that were waiting for acks to come in. This pull
request contains:- Allocation of nvme queues on the right node from Shaohua.
This was ready long before the merge window, but waiting on an ack
from Bjorn on the PCI bit. Now that we have that, the three patches
can go in.- Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
request allocations. This caused an oops. One part from Sagi, one
part from Omar.- A loop partition scan deadlock fix from Omar, fixing a regression
in this merge window.- A three-patch series from Keith, closing up a hole on clearing out
requests on shutdown/resume.- A stable fix for nbd from Josef, fixing a leak of sockets.
- Two fixes for a regression in this window from Jan, fixing a
problem with one of his earlier patches dealing with queue vs bdi
life times.- A fix for a regression with virtio-blk, causing an IO stall if
scheduling is used. From me.- A fix for an io context lock ordering problem. From me"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Move bdi_unregister() to del_gendisk()
blk-mq: ensure that bd->last is always set correctly
block: don't call ioc_exit_icq() with the queue lock held for blk-mq
block: Initialize bd_bdi on inode initialization
loop: fix LO_FLAGS_PARTSCAN hang
nvme: Complete all stuck requests
blk-mq: Provide freeze queue timeout
blk-mq: Export blk_mq_freeze_queue_wait
nbd: stop leaking sockets
blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
blk-mq: kill blk_mq_set_alloc_data()
blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
nvme: allocate nvme_queue in correct node
PCI: add an API to get node from vector
blk-mq: allocate blk_mq_tags and requests in correct node
03 Mar, 2017
1 commit
-
For legacy scheduling, we always call ioc_exit_icq() with both the
ioc and queue lock held. This poses a problem for blk-mq with
scheduling, since the queue lock isn't what we use in the scheduler.
And since we don't need the queue lock held for ioc exit there,
don't grab it and leave any extra locking up to the blk-mq scheduler.Reported-by: Paolo Valente
Tested-by: Paolo Valente
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
02 Mar, 2017
1 commit
-
But first update the code that uses these facilities with the
new header.Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
11 Feb, 2017
1 commit
-
None of the other blk-mq elevator hooks are called with this lock held.
Additionally, it can lead to circular locking dependencies between
queue_lock and the private scheduler lock.Reported-by: Paolo Valente
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
18 Jan, 2017
2 commits
-
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.Signed-off-by: Jens Axboe
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval -
Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.Signed-off-by: Jens Axboe
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
07 Nov, 2015
1 commit
-
…d avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
19 Feb, 2014
1 commit
-
(Trivial patch.)
If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded to
rcu_access_pointer(). This commit makes this downgrade in blkg_destroy()
and ioc_destroy_icq(), both of which simply compare the RCU-protected
pointer against another pointer with no dereferencing.Signed-off-by: Paul E. McKenney
Cc: Jens Axboe
Signed-off-by: Jens Axboe
09 Nov, 2013
1 commit
-
Cc: Yinghai Lu
Cc: Tejun Heo
Cc: Andrew MortonSigned-off-by: Grygorii Strashko
Signed-off-by: Santosh Shilimkar
Signed-off-by: Jens Axboe
12 Sep, 2013
1 commit
-
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.Signed-off-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 May, 2013
1 commit
-
Block layer uses workqueues for multiple purposes. There is no real dependency
of scheduling these on the cpu which scheduled them.On a idle system, it is observed that and idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.This patch replaces normal workqueues with power efficient versions.
Cc: Jens Axboe
Signed-off-by: Viresh Kumar
Signed-off-by: Tejun Heo
28 Feb, 2013
1 commit
-
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;type T;
expression a,c,d,e;
identifier b;
statement S;
@@-T b;
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Aug, 2012
1 commit
-
Hi,
I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):[ 10.886893] ------------[ cut here ]------------
[ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[ 10.886905] Hardware name: Bochs
[ 10.886906] Modules linked in:
[ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[ 10.886908] Call Trace:
[ 10.886911] [] warn_slowpath_common+0x7a/0xb0
[ 10.886912] [] warn_slowpath_null+0x15/0x20
[ 10.886913] [] copy_process+0x1488/0x1560
[ 10.886914] [] do_fork+0xb4/0x340
[ 10.886918] [] ? recalc_sigpending+0x1a/0x50
[ 10.886919] [] ? __set_task_blocked+0x32/0x80
[ 10.886920] [] ? __set_current_blocked+0x3a/0x60
[ 10.886923] [] sys_clone+0x23/0x30
[ 10.886925] [] stub_clone+0x13/0x20
[ 10.886927] [] ? system_call_fastpath+0x16/0x1b
[ 10.886928] ---[ end trace 32a14af7ee6a590b ]---Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:dump -o -f /tmp/foo /
You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.I bisected it down to the following commit:
commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
Author: Tejun Heo
Date: Mon Mar 5 13:15:26 2012 -0800block: add io_context->active_ref
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens AxboeIt seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.Tejun, is the right thing here to add back the init, or should something else
be done?The below patch removes the warning, but I haven't done any more extensive
testing on it.Signed-off-by: Olof Johansson
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe
31 May, 2012
1 commit
-
Calling get_task_io_context() on a exiting task which isn't %current can
loop forever. This triggers at boot time on my dev machine.BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]
Fix this by making create_task_io_context() returns -EBUSY in this case
to break the loop.Signed-off-by: Eric Dumazet
Cc: Tejun Heo
Cc: Andrew Morton
Cc: Alan Cox
Signed-off-by: Jens Axboe
02 Apr, 2012
1 commit
-
cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.* Dynamic addition and removal of cftypes to make config/stat file
handling modular for policies.* cgroup removal update to not wait for css references to drain to fix
blkcg removal hang caused by cfq caching cfqgs.Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
following conflicts in block/blk-cgroup.c.* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
removal. Resolved by removing @subsys from all subsys methods.* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
controllers" conflicts with ->pre_destroy() and ->attach() updates
and removal of modular config. Resolved by dropping forward
declarations of the methods and applying updates to the relocated
blkio_subsys.* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
cftype interface" builds upon the previous item. Resolved by adding
->base_cftypes to the relocated blkio_subsys.Signed-off-by: Tejun Heo
20 Mar, 2012
1 commit
-
After the previous patch to cfq, there's no ioc_get_changed() user
left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all
related stuff.Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe
14 Mar, 2012
1 commit
-
When put_io_context is called, if ioc->icq_list is empty and refcount
is 1, kernel will not free the ioc.This is caught by following kmemleak:
unreferenced object 0xffff880036349fe0 (size 216):
comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
backtrace:
[] kmemleak_alloc+0x26/0x50
[] kmem_cache_alloc_node+0x1cc/0x2a0
[] create_io_context_slowpath+0x27/0x130
[] get_task_io_context+0xbb/0xf0
[] copy_process+0x188e/0x18b0
[] do_fork+0x11b/0x420
[] sys_clone+0x28/0x30
[] stub_clone+0x13/0x20
[] 0xffffffffffffffffioc should be freed if ioc->icq_list is empty.
Signed-off-by: Xiaotian Feng
Acked-by: Vivek Goyal
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
07 Mar, 2012
2 commits
-
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe -
Make the following interface updates to prepare for future ioc related
changes.* create_io_context() returning ioc only works for %current because it
doesn't increment ref on the ioc. Drop @task parameter from it and
always assume %current.* Make create_io_context_slowpath() return 0 or -errno and rename it
to create_task_io_context().* Make ioc_create_icq() take @ioc as parameter instead of assuming
that of %current. The caller, get_request(), is updated to create
ioc explicitly and then pass it into ioc_create_icq().Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe
15 Feb, 2012
3 commits
-
While updating locking, b2efa05265 "block, cfq: unlink
cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
from exit_io_context() to the final ioc put. While this doesn't cause
catastrophic failure, it effectively removes task exit notification to
elevator and cause noticeable IO performance degradation with CFQ.On task exit, CFQ used to immediately expire the slice if it was being
used by the exiting task as no more IO would be issued by the task;
however, after b2efa05265, the notification is lost and disk could sit
idle needlessly, leading to noticeable IO performance degradation for
certain workloads.This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
from exit_io_context(). ICQ_EXITED flag is added to avoid invoking
the callback more than once for the same icq.Walking icq_list from ioc side and invoking elevator callback requires
reverse double locking. This may be better implemented using RCU;
unfortunately, using RCU isn't trivial. e.g. RCU protection would
need to cover request_queue and queue_lock switch on cleanup makes
grabbing queue_lock from RCU unsafe. Reverse double locking should
do, at least for now.Signed-off-by: Tejun Heo
Reported-and-bisected-by: Shaohua Li
LKML-Reference:
Tested-by: Shaohua Li
Signed-off-by: Jens Axboe -
Reverse double lock dancing in ioc_release_fn() can be simplified by
just using trylock on the queue_lock and back out from ioc lock on
trylock failure. Simplify it.Signed-off-by: Tejun Heo
Tested-by: Shaohua Li
Signed-off-by: Jens Axboe -
icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and
access it under ioc->lock instead of using atomic bitops.
ioc_get_changed() is added so that the changed part can be fetched and
cleared as before.icq->flags will be used to carry other flags.
Signed-off-by: Tejun Heo
Tested-by: Shaohua Li
Signed-off-by: Jens Axboe
11 Feb, 2012
1 commit
-
11a3122f6c "block: strip out locking optimization in put_io_context()"
removed ioc_lock depth lockdep annoation along with locking
optimization; however, while recursing from put_io_context() is no
longer possible, ioc_release_fn() may still end up putting the last
reference of another ioc through elevator, which wlil grab ioc->lock
triggering spurious (as the ioc is always different one) A-A deadlock
warning.As this can only happen one time from ioc_release_fn(), using non-zero
subclass from ioc_release_fn() is enough. Use subclass 1.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe
07 Feb, 2012
1 commit
-
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.Signed-off-by: Tejun Heo
LKML-Reference:
Signed-off-by: Jens Axboe
06 Feb, 2012
1 commit
-
Meelis reported a warning:
WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
Hardware name: 939Dual-SATA2
timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176
Call Trace:
[] warn_slowpath_common+0x7e/0x96
[] ? cfq_slice_expired+0x1d/0x1d
[] warn_slowpath_fmt+0x41/0x43
[] ? cfq_idle_slice_timer+0xa1/0xaa
[] ? cfq_slice_expired+0x1d/0x1d
[] run_timer_softirq+0x199/0x1ec
[] ? timekeeping_get_ns+0x12/0x31
[] ? apic_write+0x11/0x13
[] __do_softirq+0x74/0xfa
[] call_softirq+0x1a/0x30
[] do_softirq+0x31/0x68
[] irq_exit+0x3d/0xa3
[] smp_apic_timer_interrupt+0x6b/0x77
[] apic_timer_interrupt+0x69/0x70
[] ? sched_clock_cpu+0x73/0x7d
[] ? sched_clock_cpu+0x73/0x7d
[] ? default_idle+0x1e/0x32
[] ? default_idle+0x18/0x32
[] cpu_idle+0x87/0xd1
[] rest_init+0x85/0x89
[] start_kernel+0x2eb/0x2f8
[] x86_64_start_reservations+0x7e/0x82
[] x86_64_start_kernel+0xf0/0xf7this_q == locked_q is possible. There are two problems here:
1. In UP case, there is preemption counter issue as spin_trylock always
successes.
2. In SMP case, the loop breaks too earlier.Signed-off-by: Shaohua Li
Reported-by: Meelis Roos
Reported-by: Knut Petersen
Tested-by: Knut Petersen
Signed-off-by: Jens Axboe
28 Dec, 2011
1 commit
-
6e736be7 "block: make ioc get/put interface more conventional and fix
race on alloction" added WARN_ON_ONCE() in exit_io_context() which
triggers if !PF_EXITING. All tasks hitting exit_io_context() from
task exit should have PF_EXITING set but task struct tearing down
after fork failure calls into the function without PF_EXITING,
triggering the condition.WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
Call Trace:
[] warn_slowpath_common+0x8f/0xb2
[] warn_slowpath_null+0x18/0x1a
[] exit_io_context+0x40/0x92
[] copy_process+0x126f/0x1453
[] do_fork+0x120/0x3e9
[] sys_clone+0x26/0x28
[] stub_clone+0x13/0x20
---[ end trace a2e4eb670b375238 ]---Reported-by: Sasha Levin
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe
25 Dec, 2011
1 commit
-
While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context. This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.Signed-off-by: Tejun Heo
Reported-by: Andrew Morton
Reported-by: Hugh Dickins
Signed-off-by: Jens Axboe
19 Dec, 2011
1 commit
-
With the ioc changed, ioc_cgroup_changed() can be used by modular
code. So ensure that it is exported.Reported-by: Stephen Rothwell
Signed-off-by: Jens Axboe
14 Dec, 2011
5 commits
-
Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.* Comment added to explain how icq's are managed and synchronized.
This completes cleanup of io_context interface.
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
Currently io_context and cfq logics are mixed without clear boundary.
Most of io_context is independent from cfq but cfq_io_context handling
logic is dispersed between generic ioc code and cfq.cfq_io_context represents association between an io_context and a
request_queue, which is a concept useful outside of cfq, but it also
contains fields which are useful only to cfq.This patch takes out generic part and put it into io_cq (io
context-queue) and the rest into cfq_io_cq (cic moniker remains the
same) which contains io_cq. The following changes are made together.* cfq_ttime and cfq_io_cq now live in cfq-iosched.c.
* All related fields, functions and constants are renamed accordingly.
* ioc->ioc_data is now "struct io_cq *" instead of "void *" and
renamed to icq_hint.This prepares for io_context API cleanup. Documentation is currently
sparse. It will be added later.Changes in this patch are mechanical and don't cause functional
change.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path. This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.Given the restriction, accessor + creator rolled into one doesn't work
too well. Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().Future ioc updates will further consolidate ioc access and the create
interface will be unexported.While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.This patch does not introduce functional change.
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe