Eric Lee / smarc-fsl-linux-kernel

17 Jul, 2020

2 commits

ab96bbab4 block: remove retry loop in ioc_release_fn() ... Browse Code »

The reverse-order double lock dance in ioc_release_fn() is using a
retry loop. This is a problem on PREEMPT_RT because it could preempt
the task that would release q->queue_lock and thus live lock in the
retry loop.

RCU is already managing the freeing of the request queue and icq. If
the trylock fails, use RCU to guarantee that the request queue and
icq are not freed and re-acquire the locks in the correct order,
allowing forward progress.

Signed-off-by: John Ogness
Reviewed-by: Daniel Wagner
Signed-off-by: Jens Axboe

John Ogness
2020-07-17 00:22:15 +0800
a43f085f8 block: remove unnecessary ioc nested locking ... Browse Code »

The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
elevator callback. This led to a lockdep warning, which was fixed in
commit d8c66c5d5924 ("block: fix lockdep warning on io_context release
put_io_context()") by using a nested subclass for the ioc spinlock.
However, with commit f382fb0bcef4 ("block: remove legacy IO schedulers")
the CFQ IO scheduler no longer exists.

The BFQ IO scheduler also implements the exit_icq() elevator callback but
does not call put_io_context().

The nested subclass for the ioc spinlock is no longer needed. Since it
existed as an exception and no longer applies, remove the nested subclass
usage.

Signed-off-by: John Ogness
Reviewed-by: Daniel Wagner
Signed-off-by: Jens Axboe

John Ogness
2020-07-17 00:22:15 +0800

12 Mar, 2020

1 commit

30a2da7b7 block: Fix use-after-free issue accessing struct io_cq ... Browse Code »

There is a potential race between ioc_release_fn() and
ioc_clear_queue() as shown below, due to which below kernel
crash is observed. It also can result into use-after-free
issue.

context#1: context#2:
ioc_release_fn() __ioc_clear_queue() gets the same icq
->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock);
->ioc_destroy_icq(icq);
->list_del_init(&icq->q_node);
->call_rcu(&icq->__rcu_head,
icq_free_icq_rcu);
->spin_unlock(&ioc->lock);
->ioc_destroy_icq(icq);
->hlist_del_init(&icq->ioc_node);
This results into below crash as this memory
is now used by icq->__rcu_head in context#1.
There is a chance that icq could be free'd
as well.

22150.386550: Unable to handle kernel write to read-only memory
at virtual address ffffffaa8d31ca50
...
Call trace:
22150.607350: ioc_destroy_icq+0x44/0x110
22150.611202: ioc_clear_queue+0xac/0x148
22150.615056: blk_cleanup_queue+0x11c/0x1a0
22150.619174: __scsi_remove_device+0xdc/0x128
22150.623465: scsi_forget_host+0x2c/0x78
22150.627315: scsi_remove_host+0x7c/0x2a0
22150.631257: usb_stor_disconnect+0x74/0xc8
22150.635371: usb_unbind_interface+0xc8/0x278
22150.639665: device_release_driver_internal+0x198/0x250
22150.644897: device_release_driver+0x24/0x30
22150.649176: bus_remove_device+0xec/0x140
22150.653204: device_del+0x270/0x460
22150.656712: usb_disable_device+0x120/0x390
22150.660918: usb_disconnect+0xf4/0x2e0
22150.664684: hub_event+0xd70/0x17e8
22150.668197: process_one_work+0x210/0x480
22150.672222: worker_thread+0x32c/0x4c8

Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
indicate this icq is once marked as destroyed. Also, ensure
__ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
that icq doesn't get free'd up while it is still using it.

Signed-off-by: Sahitya Tummala
Co-developed-by: Pradeep P V K
Signed-off-by: Pradeep P V K
Signed-off-by: Jens Axboe

Sahitya Tummala
2020-03-12 21:07:38 +0800

16 Nov, 2018

2 commits

0d945c1f9 block: remove the queue_lock indirection ... Browse Code »

With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig

Fixed floppy and blk-cgroup missing conversions and half done edits.

Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:17:28 +0800
b6676f653 block: remove a few unused exports ... Browse Code »

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:13:25 +0800

10 Nov, 2018

1 commit

e41128cfd block: remove set but not used variable 'et' ... Browse Code »

Fixes gcc '-Wunused-but-set-variable' warning:

block/blk-ioc.c: In function 'put_io_context_active':
block/blk-ioc.c:174:24: warning:
variable 'et' set but not used [-Wunused-but-set-variable]

It not used any more after commit
a1ce35fa4985 ("block: remove dead elevator code")

Signed-off-by: YueHaibing
Signed-off-by: Jens Axboe

YueHaibing
2018-11-10 11:08:00 +0800

08 Nov, 2018

2 commits

f9cd4bfe9 block: get rid of MQ scheduler ops union ... Browse Code »

This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.

Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:42:32 +0800
a1ce35fa4 block: remove dead elevator code ... Browse Code »

This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.

Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.

Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.

Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:42:32 +0800

09 Jul, 2018

1 commit

c137969bd block, mm: remove unnecessary __GFP_HIGH flag ... Browse Code »

The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.

Signed-off-by: Shakeel Butt
Signed-off-by: Jens Axboe

Shakeel Butt
2018-07-09 23:07:54 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

04 Mar, 2017

1 commit

e0d072250 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"A collection of fixes for this merge window, either fixes for existing
issues, or parts that were waiting for acks to come in. This pull
request contains:

- Allocation of nvme queues on the right node from Shaohua.

This was ready long before the merge window, but waiting on an ack
from Bjorn on the PCI bit. Now that we have that, the three patches
can go in.

- Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
request allocations. This caused an oops. One part from Sagi, one
part from Omar.

- A loop partition scan deadlock fix from Omar, fixing a regression
in this merge window.

- A three-patch series from Keith, closing up a hole on clearing out
requests on shutdown/resume.

- A stable fix for nbd from Josef, fixing a leak of sockets.

- Two fixes for a regression in this window from Jan, fixing a
problem with one of his earlier patches dealing with queue vs bdi
life times.

- A fix for a regression with virtio-blk, causing an IO stall if
scheduling is used. From me.

- A fix for an io context lock ordering problem. From me"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: Move bdi_unregister() to del_gendisk()
blk-mq: ensure that bd->last is always set correctly
block: don't call ioc_exit_icq() with the queue lock held for blk-mq
block: Initialize bd_bdi on inode initialization
loop: fix LO_FLAGS_PARTSCAN hang
nvme: Complete all stuck requests
blk-mq: Provide freeze queue timeout
blk-mq: Export blk_mq_freeze_queue_wait
nbd: stop leaking sockets
blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
blk-mq: kill blk_mq_set_alloc_data()
blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
nvme: allocate nvme_queue in correct node
PCI: add an API to get node from vector
blk-mq: allocate blk_mq_tags and requests in correct node

Linus Torvalds
2017-03-04 02:53:35 +0800

03 Mar, 2017

1 commit

7b36a7189 block: don't call ioc_exit_icq() with the queue lock held for blk-mq ... Browse Code »

For legacy scheduling, we always call ioc_exit_icq() with both the
ioc and queue lock held. This poses a problem for blk-mq with
scheduling, since the queue lock isn't what we use in the scheduler.
And since we don't need the queue lock held for ioc exit there,
don't grab it and leave any extra locking up to the blk-mq scheduler.

Reported-by: Paolo Valente
Tested-by: Paolo Valente
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-03-03 04:59:08 +0800

02 Mar, 2017

1 commit

f719ff9bc sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> ... Browse Code »

But first update the code that uses these facilities with the
new header.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:38 +0800

11 Feb, 2017

1 commit

3d492c2e0 blk-mq-sched: don't hold queue_lock when calling exit_icq ... Browse Code »

None of the other blk-mq elevator hooks are called with this lock held.
Additionally, it can lead to circular locking dependencies between
queue_lock and the private scheduler lock.

Reported-by: Paolo Valente
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-02-11 02:34:47 +0800

18 Jan, 2017

2 commits

bd166ef18 blk-mq-sched: add framework for MQ capable IO schedulers ... Browse Code »

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval

Jens Axboe
2017-01-18 01:04:20 +0800
c51ca6cf5 block: move existing elevator ops to union ... Browse Code »

Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval

Jens Axboe
2017-01-18 01:03:33 +0800

07 Nov, 2015

1 commit

d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800

19 Feb, 2014

1 commit

ec6c676a0 block: Substitute rcu_access_pointer() for rcu_dereference_raw() ... Browse Code »

(Trivial patch.)

If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded to
rcu_access_pointer(). This commit makes this downgrade in blkg_destroy()
and ioc_destroy_icq(), both of which simply compare the RCU-protected
pointer against another pointer with no dereferencing.

Signed-off-by: Paul E. McKenney
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Paul E. McKenney
2014-02-19 04:21:26 +0800

09 Nov, 2013

1 commit

d17ab4592 block: cleanup removing dependency on bootmem headers ... Browse Code »

Cc: Yinghai Lu
Cc: Tejun Heo
Cc: Andrew Morton

Signed-off-by: Grygorii Strashko
Signed-off-by: Santosh Shilimkar
Signed-off-by: Jens Axboe

Grygorii Strashko
2013-11-09 10:43:48 +0800

12 Sep, 2013

1 commit

5e4c0d974 lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt ... Browse Code »

With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:

radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];

...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];

And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.

We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.

Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.

Signed-off-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2013-09-12 06:59:36 +0800

15 May, 2013

1 commit

695588f94 block: queue work on power efficient wq ... Browse Code »

Block layer uses workqueues for multiple purposes. There is no real dependency
of scheduling these on the cpu which scheduled them.

On a idle system, it is observed that and idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions.

Cc: Jens Axboe
Signed-off-by: Viresh Kumar
Signed-off-by: Tejun Heo

Viresh Kumar
2013-05-15 01:50:07 +0800

28 Feb, 2013

1 commit

b67bfe0d4 hlist: drop the node parameter from iterators ... Browse Code »

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-28 11:10:24 +0800

01 Aug, 2012

1 commit

4638a83e8 block: uninitialized ioc->nr_tasks triggers WARN_ON ... Browse Code »

Hi,

I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):

[ 10.886893] ------------[ cut here ]------------
[ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[ 10.886905] Hardware name: Bochs
[ 10.886906] Modules linked in:
[ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[ 10.886908] Call Trace:
[ 10.886911] [] warn_slowpath_common+0x7a/0xb0
[ 10.886912] [] warn_slowpath_null+0x15/0x20
[ 10.886913] [] copy_process+0x1488/0x1560
[ 10.886914] [] do_fork+0xb4/0x340
[ 10.886918] [] ? recalc_sigpending+0x1a/0x50
[ 10.886919] [] ? __set_task_blocked+0x32/0x80
[ 10.886920] [] ? __set_current_blocked+0x3a/0x60
[ 10.886923] [] sys_clone+0x23/0x30
[ 10.886925] [] stub_clone+0x13/0x20
[ 10.886927] [] ? system_call_fastpath+0x16/0x1b
[ 10.886928] ---[ end trace 32a14af7ee6a590b ]---

Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:

dump -o -f /tmp/foo /

You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.

I bisected it down to the following commit:

commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
Author: Tejun Heo
Date: Mon Mar 5 13:15:26 2012 -0800

block: add io_context->active_ref

Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.

This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.

Tejun, is the right thing here to add back the init, or should something else
be done?

The below patch removes the warning, but I haven't done any more extensive
testing on it.

Signed-off-by: Olof Johansson
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Olof Johansson
2012-08-01 18:17:27 +0800

31 May, 2012

1 commit

3c9c708c9 block: avoid infinite loop in get_task_io_context() ... Browse Code »

Calling get_task_io_context() on a exiting task which isn't %current can
loop forever. This triggers at boot time on my dev machine.

BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]

Fix this by making create_task_io_context() returns -EBUSY in this case
to break the loop.

Signed-off-by: Eric Dumazet
Cc: Tejun Heo
Cc: Andrew Morton
Cc: Alan Cox
Signed-off-by: Jens Axboe

Eric Dumazet
2012-05-31 19:39:05 +0800

02 Apr, 2012

1 commit

959d851ca Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-merged ... Browse Code »

cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.

* Dynamic addition and removal of cftypes to make config/stat file
handling modular for policies.

* cgroup removal update to not wait for css references to drain to fix
blkcg removal hang caused by cfq caching cfqgs.

Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
following conflicts in block/blk-cgroup.c.

* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
removal. Resolved by removing @subsys from all subsys methods.

* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
controllers" conflicts with ->pre_destroy() and ->attach() updates
and removal of modular config. Resolved by dropping forward
declarations of the methods and applying updates to the relocated
blkio_subsys.

* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
cftype interface" builds upon the previous item. Resolved by adding
->base_cftypes to the relocated blkio_subsys.

Signed-off-by: Tejun Heo

Tejun Heo
2012-04-02 03:55:00 +0800

20 Mar, 2012

1 commit

2b566fa55 block: remove ioc_*_changed() ... Browse Code »

After the previous patch to cfq, there's no ioc_get_changed() user
left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all
related stuff.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-03-20 19:47:48 +0800

14 Mar, 2012

1 commit

ff8c1474c block: fix ioc leak in put_io_context ... Browse Code »

When put_io_context is called, if ioc->icq_list is empty and refcount
is 1, kernel will not free the ioc.

This is caught by following kmemleak:

unreferenced object 0xffff880036349fe0 (size 216):
comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
backtrace:
[] kmemleak_alloc+0x26/0x50
[] kmem_cache_alloc_node+0x1cc/0x2a0
[] create_io_context_slowpath+0x27/0x130
[] get_task_io_context+0xbb/0xf0
[] copy_process+0x188e/0x18b0
[] do_fork+0x11b/0x420
[] sys_clone+0x28/0x30
[] stub_clone+0x13/0x20
[] 0xffffffffffffffff

ioc should be freed if ioc->icq_list is empty.
Signed-off-by: Xiaotian Feng
Acked-by: Vivek Goyal
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Xiaotian Feng
2012-03-14 22:34:48 +0800

07 Mar, 2012

2 commits

f6e8d01be block: add io_context->active_ref ... Browse Code »
43

Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.

This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-03-07 04:27:24 +0800
24acfc34f block: interface update for ioc/icq creation functions ... Browse Code »

Make the following interface updates to prepare for future ioc related
changes.

* create_io_context() returning ioc only works for %current because it
doesn't increment ref on the ioc. Drop @task parameter from it and
always assume %current.

* Make create_io_context_slowpath() return 0 or -errno and rename it
to create_task_io_context().

* Make ioc_create_icq() take @ioc as parameter instead of assuming
that of %current. The caller, get_request(), is updated to create
ioc explicitly and then pass it into ioc_create_icq().

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-03-07 04:27:24 +0800

15 Feb, 2012

3 commits

621032ad6 block: exit_io_context() should call elevator_exit_icq_fn() ... Browse Code »

While updating locking, b2efa05265 "block, cfq: unlink
cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
from exit_io_context() to the final ioc put. While this doesn't cause
catastrophic failure, it effectively removes task exit notification to
elevator and cause noticeable IO performance degradation with CFQ.

On task exit, CFQ used to immediately expire the slice if it was being
used by the exiting task as no more IO would be issued by the task;
however, after b2efa05265, the notification is lost and disk could sit
idle needlessly, leading to noticeable IO performance degradation for
certain workloads.

This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
from exit_io_context(). ICQ_EXITED flag is added to avoid invoking
the callback more than once for the same icq.

Walking icq_list from ioc side and invoking elevator callback requires
reverse double locking. This may be better implemented using RCU;
unfortunately, using RCU isn't trivial. e.g. RCU protection would
need to cover request_queue and queue_lock switch on cleanup makes
grabbing queue_lock from RCU unsafe. Reverse double locking should
do, at least for now.

Signed-off-by: Tejun Heo
Reported-and-bisected-by: Shaohua Li
LKML-Reference:
Tested-by: Shaohua Li
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-15 16:45:53 +0800
2274b029f block: simplify ioc_release_fn() ... Browse Code »

Reverse double lock dancing in ioc_release_fn() can be simplified by
just using trylock on the queue_lock and back out from ioc lock on
trylock failure. Simplify it.

Signed-off-by: Tejun Heo
Tested-by: Shaohua Li
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-15 16:45:52 +0800
d705ae6b1 block: replace icq->changed with icq->flags ... Browse Code »

icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and
access it under ioc->lock instead of using atomic bitops.
ioc_get_changed() is added so that the changed part can be fetched and
cleared as before.

icq->flags will be used to carry other flags.

Signed-off-by: Tejun Heo
Tested-by: Shaohua Li
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-15 16:45:49 +0800

11 Feb, 2012

1 commit

d8c66c5d5 block: fix lockdep warning on io_context release put_io_context() ... Browse Code »

11a3122f6c "block: strip out locking optimization in put_io_context()"
removed ioc_lock depth lockdep annoation along with locking
optimization; however, while recursing from put_io_context() is no
longer possible, ioc_release_fn() may still end up putting the last
reference of another ioc through elevator, which wlil grab ioc->lock
triggering spurious (as the ioc is always different one) A-A deadlock
warning.

As this can only happen one time from ioc_release_fn(), using non-zero
subclass from ioc_release_fn() is enough. Use subclass 1.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-11 19:37:25 +0800

07 Feb, 2012

1 commit

11a3122f6 block: strip out locking optimization in put_io_context() ... Browse Code »
43

put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.

While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.

Signed-off-by: Tejun Heo
LKML-Reference:
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-07 14:51:30 +0800

06 Feb, 2012

1 commit

9fa73472d block: fix ioc locking warning ... Browse Code »

Meelis reported a warning:

WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
Hardware name: 939Dual-SATA2
timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176
Call Trace:
[] warn_slowpath_common+0x7e/0x96
[] ? cfq_slice_expired+0x1d/0x1d
[] warn_slowpath_fmt+0x41/0x43
[] ? cfq_idle_slice_timer+0xa1/0xaa
[] ? cfq_slice_expired+0x1d/0x1d
[] run_timer_softirq+0x199/0x1ec
[] ? timekeeping_get_ns+0x12/0x31
[] ? apic_write+0x11/0x13
[] __do_softirq+0x74/0xfa
[] call_softirq+0x1a/0x30
[] do_softirq+0x31/0x68
[] irq_exit+0x3d/0xa3
[] smp_apic_timer_interrupt+0x6b/0x77
[] apic_timer_interrupt+0x69/0x70
[] ? sched_clock_cpu+0x73/0x7d
[] ? sched_clock_cpu+0x73/0x7d
[] ? default_idle+0x1e/0x32
[] ? default_idle+0x18/0x32
[] cpu_idle+0x87/0xd1
[] rest_init+0x85/0x89
[] start_kernel+0x2eb/0x2f8
[] x86_64_start_reservations+0x7e/0x82
[] x86_64_start_kernel+0xf0/0xf7

this_q == locked_q is possible. There are two problems here:
1. In UP case, there is preemption counter issue as spin_trylock always
successes.
2. In SMP case, the loop breaks too earlier.

Signed-off-by: Shaohua Li
Reported-by: Meelis Roos
Reported-by: Knut Petersen
Tested-by: Knut Petersen
Signed-off-by: Jens Axboe

Shaohua Li
2012-02-06 15:57:29 +0800

28 Dec, 2011

1 commit

c98b2cc29 block: remove WARN_ON_ONCE() in exit_io_context() ... Browse Code »

6e736be7 "block: make ioc get/put interface more conventional and fix
race on alloction" added WARN_ON_ONCE() in exit_io_context() which
triggers if !PF_EXITING. All tasks hitting exit_io_context() from
task exit should have PF_EXITING set but task struct tearing down
after fork failure calls into the function without PF_EXITING,
triggering the condition.

WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
Call Trace:
[] warn_slowpath_common+0x8f/0xb2
[] warn_slowpath_null+0x18/0x1a
[] exit_io_context+0x40/0x92
[] copy_process+0x126f/0x1453
[] do_fork+0x120/0x3e9
[] sys_clone+0x26/0x28
[] stub_clone+0x13/0x20
---[ end trace a2e4eb670b375238 ]---

Reported-by: Sasha Levin
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-28 01:52:16 +0800

25 Dec, 2011

1 commit

fd6383681 block: an exiting task should be allowed to create io_context ... Browse Code »

While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context. This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.

Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.

Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.

Signed-off-by: Tejun Heo
Reported-by: Andrew Morton
Reported-by: Hugh Dickins
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-25 21:29:14 +0800

19 Dec, 2011

1 commit

64c42998f block: ioc_cgroup_changed() needs to be exported ... Browse Code »

With the ioc changed, ioc_cgroup_changed() can be used by modular
code. So ensure that it is exported.

Reported-by: Stephen Rothwell
Signed-off-by: Jens Axboe

Jens Axboe
2011-12-19 17:36:44 +0800

14 Dec, 2011

2 commits

f1f8cc946 block, cfq: move icq creation and rq->elv.icq association to block core ... Browse Code »

Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.

* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().

* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.

There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.

This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.

However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.

* Comment added to explain how icq's are managed and synchronized.

This completes cleanup of io_context interface.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:42 +0800
7e5a87944 block, cfq: move io_cq exit/release to blk-ioc.c ... Browse Code »

With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:42 +0800