Eric Lee / smarc-fsl-linux-kernel

03 Oct, 2012

2 commits

68d47a137 Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.

People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.

This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

Linus Torvalds
2012-10-03 01:52:28 +0800
033d9959e Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.

* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.

* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.

These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.

* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.

All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.

* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.

* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.

There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...

Linus Torvalds
2012-10-03 00:54:49 +0800

26 Sep, 2012

1 commit

46e889478 s390/partitions: make partition detection independent from DASD ioctls ... Browse Code »

In some usage scenarios it is desireable to work with disk images or
virtualized DASD devices. One problem that prevents such applications
is the partition detection in ibm.c. Currently it works only for
devices that support the BIODASDINFO2 ioctl, in other words, it only
works for devices that belong to the DASD device driver.

The information gained from the BIODASDINFO2 ioctl is only for a small
set of legacy cases abolutely necessary. All current VOL1, LNX1 and
CMS1 type of disk labels can be interpreted correctly without this
information, as long as the generic HDIO_GETGEO ioctl works and
provides a correct disk geometry.

This patch makes the ibm.c partition detection as independent as
possible from the BIODASDINFO2 ioctl. Only the following two cases are
still restricted to real DASDs:
- An FBA DASD, or LDL formatted ECKD DASD without any disk label.
- An old style LNX1 label (without large volume support) on a disk
with inconsistent device geometry.

Signed-off-by: Stefan Weinhuber
Signed-off-by: Martin Schwidefsky

Stefan Weinhuber
2012-09-26 21:45:05 +0800

18 Sep, 2012

1 commit

2bd6efad2 blk: add an upper sanity check on partition adding ... Browse Code »

65536 should be ludicrous anyway but without it we overflow the
memory computation doing the allocation and badness occurs.

Signed-off-by: Alan Cox
Signed-off-by: Jens Axboe

Alan Cox
2012-09-18 17:56:29 +0800

15 Sep, 2012

1 commit

8c7f6edbd cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them ... Browse Code »

Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.

For now, start with a single warning message. We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Li Zefan
Acked-by: Serge E. Hallyn
Cc: Glauber Costa
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Johannes Weiner
Cc: Thomas Graf
Cc: Vivek Goyal
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Neil Horman
Cc: Aneesh Kumar K.V

Tejun Heo
2012-09-15 03:01:16 +0800

31 Aug, 2012

1 commit

37d7b34f0 block: rate-limit the error message from failing commands ... Browse Code »

When performing a cable pull test w/ active stress I/O using fio over
a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
on the other, it is observed that the system becomes not usable due to
scsi-ml being busy printing the error messages for all the failing commands.
I don't believe this problem is specific to FCoE and these commands are
anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
the messages here to solve this issue.

v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
rate-limit per function. However, in this case, the failed i/o gets to
blk_end_request_err() and then blk_update_request(), which also has to
be rate-limited, as added in the v2 of this patch.

v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.

Signed-off-by: Yi Zou
Cc: www.Open-FCoE.org
Cc: Tomas Henzl
Cc:
Signed-off-by: Jens Axboe

Yi Zou
2012-08-31 07:26:25 +0800

22 Aug, 2012

2 commits

136b5721d workqueue: deprecate __cancel_delayed_work() ... Browse Code »

Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work(). Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Cc: Jiri Kosina
Cc: Roland Dreier
Cc: Tomi Valkeinen

Tejun Heo
2012-08-22 04:18:24 +0800
e7c2f9674 workqueue: use mod_delayed_work() instead of __cancel + queue ... Browse Code »

Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
elaborate dancing around its delayed_work. Collapse it such that
linkwatch_work is queued for immediate execution if LW_URGENT and
existing timer is kept otherwise.

Signed-off-by: Tejun Heo
Cc: "David S. Miller"
Cc: Tomi Valkeinen

Tejun Heo
2012-08-22 04:18:24 +0800

21 Aug, 2012

1 commit

3b07e9ca2 workqueue: deprecate system_nrt[_freezable]_wq ... Browse Code »

system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo
Acked-By: Lai Jiangshan

Cc: Jens Axboe
Cc: David Airlie
Cc: Jiri Kosina
Cc: "David S. Miller"
Cc: Rusty Russell
Cc: "Paul E. McKenney"
Cc: David Howells

Tejun Heo
2012-08-21 05:51:24 +0800

14 Aug, 2012

1 commit

41f63c535 workqueue: use mod_delayed_work() instead of cancel + queue ... Browse Code »

Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().

Most conversions are straight-forward. Ones worth mentioning are,

* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
use mod_delayed_work() and cancel loop in
edac_mc_reset_delay_period() is dropped.

* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
watchdog is active or not. @fan_watchdog_active and related code
dropped.

* drivers/power/charger-manager.c: Seemingly a lot of
delayed_work_pending() abuse going on here.
[delayed_]work_pending() are unsynchronized and racy when used like
this. I converted one instance in fullbatt_handler(). Please
conver the rest so that it invokes workqueue APIs for the intended
target state rather than trying to game work item pending state
transitions. e.g. if timer should be modified - call
mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
simplified. Note that round_jiffies() calls in this function are
meaningless. round_jiffies() work on absolute jiffies not delta
delay used by delayed_work.

v2: Tomi pointed out that __cancel_delayed_work() users can't be
safely converted to mod_delayed_work(). They could be calling it
from irq context and if that happens while delayed_work_timer_fn()
is running, it could deadlock. __cancel_delayed_work() users are
dropped.

Signed-off-by: Tejun Heo
Acked-by: Henrique de Moraes Holschuh
Acked-by: Dmitry Torokhov
Acked-by: Anton Vorontsov
Acked-by: David Howells
Cc: Tomi Valkeinen
Cc: Jens Axboe
Cc: Jiri Kosina
Cc: Doug Thompson
Cc: David Airlie
Cc: Roland Dreier
Cc: "John W. Linville"
Cc: Zhang Rui
Cc: Len Brown
Cc: "J. Bruce Fields"
Cc: Johannes Berg

Tejun Heo
2012-08-14 07:27:37 +0800

03 Aug, 2012

3 commits

067680670 block: Don't use static to define "void *p" in show_partition_start() ... Browse Code »

I met a odd prblem:read /proc/partitions may return zero.

I wrote a file test.c:
int main()
{
char buff[4096];
int ret;
int fd;
printf("pid=%d\n",getpid());
while (1) {
fd = open("/proc/partitions", O_RDONLY);
if (fd < 0) {
printf("open error %s\n", strerror(errno));
return 0;
}
ret = read(fd, buff, 4096);
if (ret /dev/null ;done
2:./test

I reviewed the code and found:

>> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
>> {
>> static void *p;
>>
>> p = disk_seqf_start(seqf, pos);
>> if (!IS_ERR_OR_NULL(p) && !*pos)
>> seq_puts(seqf, "major minor #blocks name\n\n");
>> return p;
>> }
test cat /proc/partitions
p = disk_seqf_start()(Not NULL)
p = disk_seqf_start()(NULL because pos)
if (!IS_ERR_OR_NULL(p) && !*pos)

Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe

Jianpeng Ma
2012-08-03 16:42:00 +0800
85b9f66a4 block: Add blk_bio_map_sg() helper ... Browse Code »

Add a helper to map a bio to a scatterlist, modelled after
blk_rq_map_sg.

This helper is useful for any driver that wants to create
a scatterlist from its ->make_request_fn method.

Changes in v2:
- Use __blk_segment_map_sg to avoid duplicated code
- Add cocbook style function comment

Cc: Rusty Russell
Cc: Christoph Hellwig
Cc: Tejun Heo
Cc: Shaohua Li
Cc: "Michael S. Tsirkin"
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig
Signed-off-by: Minchan Kim
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-08-03 05:42:04 +0800
963ab9e5d block: Introduce __blk_segment_map_sg() helper ... Browse Code »

Split the mapping code in blk_rq_map_sg() to a helper
__blk_segment_map_sg(), so that other mapping function, e.g.
blk_bio_map_sg(), can share the code.

Cc: Rusty Russell
Cc: Christoph Hellwig
Cc: Tejun Heo
Cc: Shaohua Li
Cc: "Michael S. Tsirkin"
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Suggested-by: Jens Axboe
Suggested-by: Tejun Heo
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-08-03 05:42:03 +0800

02 Aug, 2012

4 commits

c6e666345 block: split discard into aligned requests ... Browse Code »

When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment. In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.

Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128. A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded. With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257. The patch will also
take into account the discard_alignment.

At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors. Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.

Signed-off-by: Paolo Bonzini
Acked-by: Vivek Goyal
Tested-by: Mike Snitzer
Signed-off-by: Jens Axboe

Paolo Bonzini
2012-08-02 15:48:50 +0800
f6ff53d36 block: reorganize rounding of max_discard_sectors ... Browse Code »

Mostly a preparation for the next patch.

In principle this fixes an infinite loop if max_discard_sectors < granularity,
but that really shouldn't happen.

Signed-off-by: Paolo Bonzini
Acked-by: Vivek Goyal
Tested-by: Mike Snitzer
Signed-off-by: Jens Axboe

Paolo Bonzini
2012-08-02 15:48:49 +0800
eff0d13f3 Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver changes from Jens Axboe:

- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.

- The usual round of drbd updates.

- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.

- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous

Linus Torvalds
2012-08-02 00:06:47 +0800
8cf1a3fce Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block IO bits from Jens Axboe:
"The most complicated part if this is the request allocation rework by
Tejun, which has been queued up for a long time and has been in
for-next ditto as well.

There are a few commits from yesterday and today, mostly trivial and
obvious fixes. So I'm pretty confident that it is sound. It's also
smaller than usual."

* 'for-3.6/core' of git://git.kernel.dk/linux-block:
block: remove dead func declaration
block: add partition resize function to blkpg ioctl
block: uninitialized ioc->nr_tasks triggers WARN_ON
block: do not artificially constrain max_sectors for stacking drivers
blkcg: implement per-blkg request allocation
block: prepare for multiple request_lists
block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
blkcg: inline bio_blkcg() and friends
block: allocate io_context upfront
block: refactor get_request[_wait]()
block: drop custom queue draining used by scsi_transport_{iscsi|fc}
mempool: add @gfp_mask to mempool_create_node()
blkcg: make root blkcg allocation use %GFP_KERNEL
blkcg: __blkg_lookup_create() doesn't need radix preload

Linus Torvalds
2012-08-02 00:02:41 +0800

01 Aug, 2012

4 commits

80799fbb7 block: remove dead func declaration ... Browse Code »

__generic_unplug_device() function is removed with commit
7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
remove the declaration at meantime. Here remove it.

Signed-off-by: Yuanhan Liu
Signed-off-by: Jens Axboe

Yuanhan Liu
2012-08-01 18:25:54 +0800
c83f6bf98 block: add partition resize function to blkpg ioctl ... Browse Code »

Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.

This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.

Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().

v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

Signed-off-by: Vivek Goyal
Signed-off-by: Phillip Susi
Signed-off-by: Jens Axboe

Vivek Goyal
2012-08-01 18:24:18 +0800
4638a83e8 block: uninitialized ioc->nr_tasks triggers WARN_ON ... Browse Code »

Hi,

I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):

[ 10.886893] ------------[ cut here ]------------
[ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[ 10.886905] Hardware name: Bochs
[ 10.886906] Modules linked in:
[ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[ 10.886908] Call Trace:
[ 10.886911] [] warn_slowpath_common+0x7a/0xb0
[ 10.886912] [] warn_slowpath_null+0x15/0x20
[ 10.886913] [] copy_process+0x1488/0x1560
[ 10.886914] [] do_fork+0xb4/0x340
[ 10.886918] [] ? recalc_sigpending+0x1a/0x50
[ 10.886919] [] ? __set_task_blocked+0x32/0x80
[ 10.886920] [] ? __set_current_blocked+0x3a/0x60
[ 10.886923] [] sys_clone+0x23/0x30
[ 10.886925] [] stub_clone+0x13/0x20
[ 10.886927] [] ? system_call_fastpath+0x16/0x1b
[ 10.886928] ---[ end trace 32a14af7ee6a590b ]---

Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:

dump -o -f /tmp/foo /

You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.

I bisected it down to the following commit:

commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
Author: Tejun Heo
Date: Mon Mar 5 13:15:26 2012 -0800

block: add io_context->active_ref

Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.

This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.

Tejun, is the right thing here to add back the init, or should something else
be done?

The below patch removes the warning, but I haven't done any more extensive
testing on it.

Signed-off-by: Olof Johansson
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Olof Johansson
2012-08-01 18:17:27 +0800
fe86cdcef block: do not artificially constrain max_sectors for stacking drivers ... Browse Code »

blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.

It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).

Reported-by: Vijay Chauhan
Tested-by: Vijay Chauhan
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mike Snitzer
2012-08-01 16:44:28 +0800

31 Jul, 2012

3 commits

74018dc30 blk: pass from_schedule to non-request unplug functions. ... Browse Code »

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:15 +0800
2a7d5559b block: stack unplug ... Browse Code »

MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

Shaohua Li
2012-07-31 15:08:15 +0800
9cbb17508 blk: centralize non-request unplug handling. ... Browse Code »

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:14 +0800

20 Jul, 2012

1 commit

e81ca6fe8 [SCSI] block: Fix blk_execute_rq_nowait() dead queue handling ... Browse Code »

If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
callback function. That will result in blk_execute_rq() being stuck
in wait_for_completion(). Avoid this by initializing rq->end_io to the
done() callback before we check the queue state. Also, make sure the
queue lock is held around the invocation of the done() callback. Found
this through source code review.

Signed-off-by: Muthukumar Ratty
Signed-off-by: Bart Van Assche
Reviewed-by: Tejun Heo
Acked-by: Jens Axboe
Signed-off-by: James Bottomley

Muthukumar Ratty
2012-07-20 15:58:39 +0800

27 Jun, 2012

1 commit

a051661ca blkcg: implement per-blkg request allocation ... Browse Code »

Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.

This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.

This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.

* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.

* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.

* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).

v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.

v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer .

v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Cc: Wu Fengguang
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-27 06:42:49 +0800

25 Jun, 2012

9 commits

5b788ce3e block: prepare for multiple request_lists ... Browse Code »

Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list. blk_*queue_full() functions
are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
instead of @q. Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
blk_release_queue().

* Add request_list->q and make request alloc/free functions -
blk_free_request(), [__]freed_request(), __get_request() - take @rl
instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:52 +0800
8a5ecdd42 block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv ... Browse Code »

Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].

These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.

While at it, properly indent fields of struct request_list.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:51 +0800
b1208b56f blkcg: inline bio_blkcg() and friends ... Browse Code »

Make bio_blkcg() and friends inline. They all are very simple and
used only in few places.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:50 +0800
7f4b35d15 block: allocate io_context upfront ... Browse Code »

Block layer very lazy allocation of ioc. It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.

Just allocate it up-front on entry to block layer. We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:50 +0800
a06e05e6a block: refactor get_request[_wait]() ... Browse Code »

Currently, there are two request allocation functions - get_request()
and get_request_wait(). The former tries to allocate a request once
and the latter keeps retrying until it succeeds. The latter wraps the
former and keeps retrying until allocation succeeds.

The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation. However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.

This patch simplifies the interface as follows.

* get_request() is renamed to __get_request() and is only used by the
wrapper function.

* get_request_wait() is renamed to get_request(). It now takes
@gfp_mask and retries iff it contains %__GFP_WAIT.

This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:49 +0800
86072d811 block: drop custom queue draining used by scsi_transport_{iscsi|fc} ... Browse Code »

iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining. fc_bsg_remove() open-codes mostly identical logic.

The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.

blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary. Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.

Signed-off-by: Tejun Heo
Reviewed-by: Mike Christie
Acked-by: Vivek Goyal
Cc: James Bottomley
Cc: James Smart
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:48 +0800
a91a5ac68 mempool: add @gfp_mask to mempool_create_node() ... Browse Code »

mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.

Signed-off-by: Tejun Heo
Cc: Andrew Morton
Cc: Hugh Dickins
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:47 +0800
159749937 blkcg: make root blkcg allocation use %GFP_KERNEL ... Browse Code »

Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation. This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
instead of ERR_PTR() value. Fixed.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:46 +0800
13589864b blkcg: __blkg_lookup_create() doesn't need radix preload ... Browse Code »

There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask. Drop preloading from
__blkg_lookup_create().

While at it, drop sparse locking annotation which no longer applies.

v2: Vivek pointed out the odd preload usage. Instead of updating,
just drop it.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:45 +0800

15 Jun, 2012

4 commits

6d9359280 scsi: Silence unnecessary warnings about ioctl to partition ... Browse Code »

Sometimes, warnings about ioctls to partition happen often enough that they
form majority of the warnings in the kernel log and users complain. In some
cases warnings are about ioctls such as SG_IO so it's not good to get rid of
the warnings completely as they can ease debugging of userspace problems
when ioctl is refused.

Since I have seen warnings from lots of commands, including some proprietary
userspace applications, I don't think disallowing the ioctls for processes
with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.

CC: Paolo Bonzini
CC: James Bottomley
CC: linux-scsi@vger.kernel.org
Acked-by: Paolo Bonzini
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2012-06-15 18:52:46 +0800
76aaa5101 block: Drop dead function blk_abort_queue() ... Browse Code »

This function was only used by btrfs code in btrfs_abort_devices()
(seems in a wrong way).

It was removed in commit d07eb9117050c9ed3f78296ebcc06128b52693be,
So, Let's remove the dead code to avoid any confusion.

Changes in v2: update commit log, btrfs_abort_devices() was removed
already.

Cc: Jens Axboe
Cc: linux-kernel@vger.kernel.org
Cc: Chris Mason
Cc: linux-btrfs@vger.kernel.org
Cc: David Sterba
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-06-15 14:46:23 +0800
5e5cfac0c block: Mitigate lock unbalance caused by lock switching ... Browse Code »

Commit 777eb1bf15b8532c396821774bf6451e563438f5 disconnects externally
supplied queue_lock before blk_drain_queue(). Switching the lock would
introduce lock unbalance because theads which have taken the external
lock might unlock the internal lock in the during the queue drain. This
patch mitigate this by disconnecting the lock after the queue draining
since queue draining makes a lot of request_queue users go away.

However, please note, this patch only makes the problem less likely to
happen. Anyone who still holds a ref might try to issue a new request on
a dead queue after the blk_cleanup_queue() finishes draining, the lock
unbalance might still happen in this case.

=====================================
[ BUG: bad unlock balance detected! ]
3.4.0+ #288 Not tainted
-------------------------------------
fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!

other info that might help us debug this:
1 lock held by fio/17706:
#0: (&(&vblk->lock)->rlock){......}, at: []
get_request_wait+0x19a/0x250

stack backtrace:
Pid: 17706, comm: fio Not tainted 3.4.0+ #288
Call Trace:
[] ? blk_queue_bio+0x2a2/0x380
[] print_unlock_inbalance_bug+0xf9/0x100
[] lock_release_non_nested+0x1df/0x330
[] ? dio_bio_end_aio+0x34/0xc0
[] ? bio_check_pages_dirty+0x85/0xe0
[] ? dio_bio_end_aio+0xb1/0xc0
[] ? blk_queue_bio+0x2a2/0x380
[] ? blk_queue_bio+0x2a2/0x380
[] lock_release+0xd9/0x250
[] _raw_spin_unlock_irq+0x23/0x40
[] blk_queue_bio+0x2a2/0x380
[] generic_make_request+0xca/0x100
[] submit_bio+0x76/0xf0
[] ? set_page_dirty_lock+0x3c/0x60
[] ? bio_set_pages_dirty+0x51/0x70
[] do_blockdev_direct_IO+0xbf8/0xee0
[] ? blkdev_get_block+0x80/0x80
[] __blockdev_direct_IO+0x55/0x60
[] ? blkdev_get_block+0x80/0x80
[] blkdev_direct_IO+0x57/0x60
[] ? blkdev_get_block+0x80/0x80
[] generic_file_aio_read+0x70e/0x760
[] ? __lock_acquire+0x215/0x5a0
[] ? aio_run_iocb+0x54/0x1a0
[] ? grab_cache_page_nowait+0xc0/0xc0
[] aio_rw_vect_retry+0x7c/0x1e0
[] ? aio_fsync+0x30/0x30
[] aio_run_iocb+0x66/0x1a0
[] do_io_submit+0x6f0/0xb80
[] ? trace_hardirqs_on_thunk+0x3a/0x3f
[] sys_io_submit+0x10/0x20
[] system_call_fastpath+0x16/0x1b

Changes since v2: Update commit log to explain how the code is still
broken even if we delay the lock switching after the drain.
Changes since v1: Update commit log as Tejun suggested.

Acked-by: Tejun Heo
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-06-15 14:46:22 +0800
458f27a98 block: Avoid missed wakeup in request waitqueue ... Browse Code »
44

After hot-unplug a stressed disk, I found that rl->wait[] is not empty
while rl->count[] is empty and there are theads still sleeping on
get_request after the queue cleanup. With simple debug code, I found
there are exactly nr_sleep - nr_wakeup of theads in D state. So there
are missed wakeup.

$ dmesg | grep nr_sleep
[ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
$ vmstat 1
1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0

To quote Tejun:

Ah, okay, freed_request() wakes up single waiter with the assumption
that after the wakeup there will at least be one successful allocation
which in turn will continue the wakeup chain until the wait list is
empty - ie. waiter wakeup is dependent on successful request
allocation happening after each wakeup. With queue marked dead, any
woken up waiter fails the allocation path, so the wakeup chaining is
lost and we're left with hung waiters. What we need is wake_up_all()
after drain completion.

This patch fixes the missed wakeup by waking up all the theads which
are sleeping on wait queue after queue drain.

Changes in v2: Drop waitqueue_active() optimization

Acked-by: Tejun Heo
Signed-off-by: Asias He

Fixed a bug by me, where stacked devices would oops on calling
blk_drain_queue() since ->rq.wait[] do not get initialized unless
it's a full queue setup.

Signed-off-by: Jens Axboe

Asias He
2012-06-15 14:45:25 +0800

06 Jun, 2012

1 commit

27e1f9d1c blkcg: drop local variable @q from blkg_destroy() ... Browse Code »

blkg_destroy() caches @blkg->q in local variable @q. While there are
two places which needs @blkg->q, only lockdep_assert_held() used the
local variable leading to unused local variable warning if lockdep is
configured out. Drop the local variable and just use @blkg->q
directly.

Signed-off-by: Tejun Heo
Reported-by: Rakesh Iyer
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-06 14:35:31 +0800