Eric Lee / smarc-fsl-linux-kernel

02 Jun, 2011

2 commits

e2bd9678f block: Use hlist_entry() for io_context.cic_list.first ... Browse Code »

list_entry() and hlist_entry() are both simply aliases for
container_of(), but since io_context.cic_list.first is an hlist_node one
should at least use the correct alias.

Signed-off-by: Paul Bolle
Signed-off-by: Jens Axboe

Paul Bolle
2011-06-02 19:05:02 +0800
28304f485 cfq-iosched: Remove bogus check in queue_fail path ... Browse Code »

queue_fail can only be reached if cic is NULL, so its check for cic must
be bogus.

Signed-off-by: Paul Bolle
Signed-off-by: Jens Axboe

Paul Bolle
2011-06-02 19:05:02 +0800

01 Jun, 2011

1 commit

4495a7d41 CFQ: Fix typo and remove unnecessary semicolon ... Browse Code »

Fix comment typo and remove unnecessary semicolon at macro

Signed-off-by: Kyungmin Park
Signed-off-by: Jens Axboe

Kyungmin Park
2011-06-01 01:49:44 +0800

28 May, 2011

1 commit

bdf7cf1c8 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
loop: export module parameters
block: export blk_{get,put}_queue()
block: remove unused variable in bio_attempt_front_merge()
block: always allocate genhd->ev if check_events is implemented
brd: export module parameters
brd: fix comment on initial device creation
brd: handle on-demand devices correctly
brd: limit 'max_part' module param to DISK_MAX_PARTS
brd: get rid of unused members from struct brd_device
block: fix oops on !disk->queue and sysfs discard alignment display

Linus Torvalds
2011-05-28 01:24:40 +0800

27 May, 2011

4 commits

d86e0e83b block: export blk_{get,put}_queue() ... Browse Code »

We need them in SCSI to fix a bug, but currently they are not
exported to modules. Export them.

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-27 13:45:45 +0800
f780bdb7c cgroups: add per-thread subsystem callbacks ... Browse Code »

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2011-05-27 08:12:34 +0800
700c4f332 block: remove unused variable in bio_attempt_front_merge() ... Browse Code »

sector is never read inside the function.

Signed-off-by: Luca Tettamanti
Signed-off-by: Jens Axboe

Luca Tettamanti
2011-05-27 03:07:26 +0800
75e3f3ee3 block: always allocate genhd->ev if check_events is implemented ... Browse Code »

9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
drivers which have inadequate ->check_events(). Combined with earlier
change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
userland), this enables using ->check_events() for internal processing
while avoiding enabling in-kernel block event polling which can lead
to infinite event loop.

Unfortunately, this made many drivers including floppy without any bit
set in disk->events and ->async_events in which case disk_add_events()
simply skipped allocation of disk->ev, which disables whole event
handling. As ->check_events() is still used during open processing
for revalidation, this can lead to open failure.

This patch always allocates disk->ev if ->check_events is implemented.
In the long term, it would make sense to simply include the event
structure inline into genhd as it's now used by virtually all block
devices.

Signed-off-by: Tejun Heo
Reported-by: Ondrej Zary
Reported-by: Alex Villacis Lasso
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2011-05-27 03:06:50 +0800

24 May, 2011

5 commits

1547010e6 cfq-iosched: free cic_index if cfqd allocation fails ... Browse Code »

When struct cfq_data allocation fails, cic_index need to be freed.

Signed-off-by: Namhyung Kim
Signed-off-by: Jens Axboe

Namhyung Kim
2011-05-24 16:23:22 +0800
20359f27e cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add() ... Browse Code »

The 'group_changed' variable is initialized to 0 and never changed, so
checking the variable is meaningless.

It is a leftover from 0bbfeb832042 ("cfq-iosched: Always provide group
iosolation."). Let's get rid of it.

Signed-off-by: Namhyung Kim
Cc: Justin TerAvest
Signed-off-by: Jens Axboe

Namhyung Kim
2011-05-24 16:23:22 +0800
229836bd6 cfq-iosched: reduce bit operations in cfq_choose_req() ... Browse Code »

Reduce the number of bit operations in cfq_choose_req() on average
(and worst) cases.

Signed-off-by: Namhyung Kim
Signed-off-by: Jens Axboe

Namhyung Kim
2011-05-24 16:23:21 +0800
b9f8ce059 cfq-iosched: algebraic simplification in cfq_prio_to_maxrq() ... Browse Code »

Simplify the calculation in cfq_prio_to_maxrq(), plus replace CFQ_PRIO_LISTS to
IOPRIO_BE_NR since they are the same and IOPRIO_BE_NR looks more reasonable in
this context IMHO.

Signed-off-by: Namhyung Kim
Signed-off-by: Jens Axboe

Namhyung Kim
2011-05-24 16:23:21 +0800
4cbadbd16 blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time ... Browse Code »

If we don't explicitly initialize it to zero, CFQ might think that
cgroup of ioc has changed and it generates lots of unnecessary calls
to call_for_each_cic(changed_cgroup). Fix it.

cfq_get_io_context()
cfq_ioc_set_cgroup()
call_for_each_cic(ioc, changed_cgroup)

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-24 01:35:04 +0800

23 May, 2011

3 commits

95cf3dd9d block: call elv_bio_merged() when merged ... Browse Code »

Commit 73c101011926 ("block: initial patch for on-stack per-task plugging")
removed calls to elv_bio_merged() when @bio merged with @req. Re-add them.

This in turn will update merged stats in associated group. That
should be safe as long as request has got reference to the blkio_group.

Signed-off-by: Namhyung Kim
Cc: Divyesh Shah
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-23 16:02:19 +0800
317389a77 cfq-iosched: Make IO merge related stats per cpu ... Browse Code »

Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking
blkg->stats_lock.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-23 16:02:19 +0800
2abae55f5 cfq-iosched: Fix a memory leak of per cpu stats for root group ... Browse Code »

We allocated per cpu stats struct for root group but did not free it.
Fix it.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-23 16:02:19 +0800

21 May, 2011

16 commits

771949d03 block: get rid of on-stack plugging debug checks ... Browse Code »

We don't need them anymore, so kill:

- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-21 02:52:16 +0800
0eb8e8857 Merge branch 'for-linus' into for-2.6.40/core ... Browse Code »

This patch merges in a fix that missed 2.6.39 final.

Conflicts:
block/blk.h

Jens Axboe
2011-05-21 02:36:16 +0800
af75cd3c6 blk-throttle: Make no throttling rule group processing lockless ... Browse Code »

Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.

So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:53 +0800
f0bdc8cdd blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats ... Browse Code »

Now dispatch stats update is lock free. But reset of these stats still
takes blkg->stats_lock and is dependent on that. As stats are per cpu,
we should be able to just reset the stats on each cpu without any locks.
(Atleast for 64bit arch).

On 32bit arch there is a small race where 64bit updates are not atomic.
The result of this race can be that in the presence of other writers,
one might not get 0 value after reset of a stat and might see something
intermediate

One can write more complicated code to cover this race like sending IPI
to other cpus to reset stats and for offline cpus, reset these directly.

Right not I am not taking that path because reset_update is more of a
debug feature and it can happen only on 32bit arch and possibility of
it happening is small. Will fix it if it becomes a real problem. For
the time being going for code simplicity.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:53 +0800
575969a0d blk-cgroup: Make 64bit per cpu stats safe on 32bit arch ... Browse Code »

Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:53 +0800
5624a4e44 blk-throttle: Make dispatch stats per cpu ... Browse Code »

Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.

Make dispatch stats per cpu so that these can be updated without taking
blkg lock.

If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
4843c69d4 blk-throttle: Free up a group only after one rcu grace period ... Browse Code »

Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
5617cbef7 blk-throttle: Use helper function to add root throtl group to lists ... Browse Code »

Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
269f54155 blk-throttle: Introduce a helper function to fill in device details ... Browse Code »

A helper function for the code which is used at 2-3 places. Makes reading
code little easier.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
29b125892 blk-throttle: Dynamically allocate root group ... Browse Code »

Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
f469a7b4d blk-cgroup: Allow sleeping while dynamically allocating a group ... Browse Code »

Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
56edf7d75 cfq-iosched: Fix a possible race with cfq cgroup removal code ... Browse Code »

blkg->key = cfqd is an rcu protected pointer and hence we used to do
call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

The problem here is that even though cfqd is around, there are no
gurantees that associated request queue (td->queue) or q->queue_lock
is still around. A driver might have called blk_cleanup_queue() and
release the lock.

It might happen that after freeing up the lock we call
blkg->key->queue->queue_ock and crash. This is possible in following
path.

blkiocg_destroy()
blkio_unlink_group_fn()
cfq_unlink_blkio_group()

Hence, wait for an rcu peirod if there are groups which have not
been unlinked from blkcg->blkg_list. That way, if there are any groups
which are taking cfq_unlink_blkio_group() path, can safely take queue
lock.

This is how we have taken care of race in throttling logic also.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
3e59cf9d6 cfq-iosched: Get rid of redundant function parameter "create" ... Browse Code »

Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
Get rid of that.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
a23e68695 blk-cgroup: move some fields of unaccounted_time file under right config option ... Browse Code »

cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800
a29a171e7 blk-throttle: Do the new group initialization with the help of a function ... Browse Code »

Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:51 +0800
698567f3f Merge commit 'v2.6.39' into for-2.6.40/core ... Browse Code »

Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-21 02:33:15 +0800

19 May, 2011

1 commit

0a58e077e block: add proper state guards to __elv_next_request ... Browse Code »

blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing. __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.

This was reported as causing an oops attributable to scsi.

Signed-off-by: James Bottomley
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

James Bottomley
2011-05-19 01:30:32 +0800

18 May, 2011

2 commits

3ec717b7c block: don't delay blk_run_queue_async ... Browse Code »

Let's check a scenario:
1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
2. blk_run_queue_async();
the second one will became a noop, because q->delay_work already has
WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
work runs immediately.

Fix this by doing a cancel on potentially pending delayed work
before queuing an immediate run of the workqueue.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2011-05-18 18:24:03 +0800
a934a00a6 block: Fix discard topology stacking and reporting ... Browse Code »

In some cases we would end up stacking discard_zeroes_data incorrectly.
Fix this by enabling the feature by default for stacking drivers and
clearing it for low-level drivers. Incorporating a device that does not
support dzd will then cause the feature to be disabled in the stacking
driver.

Also ensure that the maximum discard value does not overflow when
exported in sysfs and return 0 in the alignment and dzd fields for
devices that don't support discard.

Reported-by: Lukas Czerner
Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Martin K. Petersen
2011-05-18 16:37:35 +0800

16 May, 2011

1 commit

70087dc38 blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup ... Browse Code »

Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.

The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.

cgrp->subsys[i] = NULL;

That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.

So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.

So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.

One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.

Signed-off-by: Vivek Goyal
Reviewed-by: Li Zefan
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-16 21:24:08 +0800

07 May, 2011

4 commits

8af1954d1 blkdev: Do not return -EOPNOTSUPP if discard is supported ... Browse Code »

Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
bio fails due to underlying device not supporting discard request.
However, if the device is for example dm device composed of devices
which some of them support discard and some of them does not, it is ok
for some bios to fail with EOPNOTSUPP, but it does not mean that discard
is not supported at all.

This commit removes the check for bios failed with EOPNOTSUPP and change
blkdev_issue_discard() to return operation not supported if and only if
the device does not actually supports it, not just part of the device as
some bios might indicate.

This change also fixes problem with BLKDISCARD ioctl() which now works
correctly on such dm devices.

Signed-off-by: Lukas Czerner
CC: Jens Axboe
CC: Jeff Moyer
Signed-off-by: Jens Axboe

Lukas Czerner
2011-05-07 09:30:01 +0800
5baebe5c8 blkdev: Simple cleanup in blkdev_issue_zeroout() ... Browse Code »

In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
not need to check for -EOPNOTSUPP specifically in case of error. Also
there is no need to have label submit: because there is no way to jump
out from the while cycle without an error and we really want to exit,
rather than try again. And also remove the check for (sz == 0) since at
that point sz can never be zero.

Signed-off-by: Lukas Czerner
Reviewed-by: Jeff Moyer
CC: Dmitry Monakhov
CC: Jens Axboe
Signed-off-by: Jens Axboe

Lukas Czerner
2011-05-07 09:26:28 +0800
5dba3089e blkdev: Submit discard bio in batches in blkdev_issue_discard() ... Browse Code »

Currently we are waiting for every submitted REQ_DISCARD bio separately,
but it can have unwanted consequences of repeatedly flushing the queue,
so we rather submit bios in batches and wait for the entire batch, hence
narrowing the window of other ios going in.

Use bio_batch_end_io() and struct bio_batch for that purpose, the same
is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
always set !BIO_UPTODATE in the case of error and remove the check for
bb, since we are the only user of this function and we always set this.

Remove bio_get()/bio_put() from the blkdev_issue_discard() since
bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
not needed anymore.

I have done simple dd testing with surprising results. The script I have
used is:

for i in $(seq 10); do
echo $i
dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
sleep 5
done
/usr/bin/time -f %e ./blkdiscard /dev/sdc1

Running time of BLKDISCARD on the whole device:
with patch without patch
0.95 15.58

So we can see that in this artificial test the kernel with the patch
applied is approx 16x faster in discarding the device.

Signed-off-by: Lukas Czerner
CC: Dmitry Monakhov
CC: Jens Axboe
CC: Jeff Moyer
Signed-off-by: Jens Axboe

Lukas Czerner
2011-05-07 09:26:27 +0800
3ac0cc450 block: hold queue if flush is running for non-queueable flush drive ... Browse Code »

In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.

In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

block: make the flush insertion use the tail of the dispatch list

It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.

which causes about 20% regression running a sysbench fileio
workload.

Stable: 2.6.39 only

Cc: stable@kernel.org
Signed-off-by: Shaohua Li
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

shaohua.li@intel.com
2011-05-07 01:36:25 +0800