Eric Lee / smarc-fsl-linux-kernel

07 Jan, 2009

1 commit

3ada8b7e9 block: struct device - replace bus_id with dev_name(), dev_set_name() ... Browse Code »

Cc: Jens Axboe
Signed-off-by: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Kay Sievers
2009-01-07 02:44:43 +0800

31 Dec, 2008

1 commit

2ca1a6158 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 ... Browse Code »

Conflicts:

arch/x86/kernel/io_apic.c

Rusty Russell
2008-12-31 20:35:57 +0800

30 Dec, 2008

1 commit

33edcf133 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Browse Code »

Rusty Russell
2008-12-30 05:32:35 +0800

29 Dec, 2008

22 commits

62c1fe9d9 cfq-iosched: fix race between exiting queue and exiting task ... Browse Code »

Original patch from Nikanth Karthikesan

When a queue exits the queue lock is taken and cfq_exit_queue() would free all
the cic's associated with the queue.

But when a task exits, cfq_exit_io_context() gets cic one by one and then
locks the associated queue to call __cfq_exit_single_io_context. It looks like
between getting a cic from the ioc and locking the queue, the queue might have
exited on another cpu.

Fix this by rechecking the cfq_io_context queue key inside the queue lock
again, and not calling into __cfq_exit_single_io_context() if somebody
beat us to it.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:29:52 +0800
b3a6ffe16 Get rid of CONFIG_LSF ... Browse Code »

We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.

Acked-by: Jean Delvare
Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:29:51 +0800
3c18ce71a block: make blk_softirq_init() static ... Browse Code »

Sparse asked whether these could be static.

Signed-off-by: Roel Kluin
Signed-off-by: Jens Axboe

Roel Kluin
2008-12-29 15:29:51 +0800
18af8b2ca block: use min_not_zero in blk_queue_stack_limits ... Browse Code »

zero is invalid for max_phys_segments, max_hw_segments, and
max_segment_size. It's better to use use min_not_zero instead of
min. min() works though (because the commit 0e435ac makes sure that
these values are set to the default values, non zero, if a queue is
initialized properly).

With this patch, blk_queue_stack_limits does the almost same thing
that dm's combine_restrictions_low() does. I think that it's easy to
remove dm's combine_restrictions_low.

Signed-off-by: FUJITA Tomonori
Signed-off-by: Jens Axboe

FUJITA Tomonori
2008-12-29 15:29:51 +0800
a6f23657d block: add one-hit cache for disk partition lookup ... Browse Code »

disk_map_sector_rcu() returns a partition from a sector offset,
which we use for IO statistics on a per-partition basis. The
lookup itself is an O(N) list lookup, where N is the number of
partitions. This actually hurts performance quite a bit, even
on the lower end partitions. On higher numbered partitions,
it can get pretty bad.

Solve this by adding a one-hit cache for partition lookup.
This makes the lookup O(1) for the case where we do most IO to
one partition. Even for mixed partition workloads, amortized cost
is pretty close to O(1) since the natural IO batching makes the
one-hit cache last for lots of IOs.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:29:51 +0800
30e0dc28b cfq-iosched: remove limit of dispatch depth of max 4 times quantum ... Browse Code »

This basically limits the hardware queue depth to 4*quantum at any
point in time, which is 16 with the default settings. As CFQ uses
other means to shrink the hardware queue when necessary in the first
place, there's really no need for this extra heuristic. Additionally,
it ends up hurting performance in some cases.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:29:51 +0800
b374d18a4 block: get rid of elevator_t typedef ... Browse Code »

Just use struct elevator_queue everywhere instead.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:29:50 +0800
a31a97381 block: don't use plugging on SSD devices ... Browse Code »

We just want to hand the first bits of IO to the device as fast
as possible. Gains a few percent on the IOPS rate.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:28:45 +0800
a185eb4bc block: fix empty barrier on write-through w/ ordered tag ... Browse Code »

Empty barrier on write-through (or no cache) w/ ordered tag has no
command to execute and without any command to execute ordered tag is
never issued to the device and the ordering is never achieved. Force
draining for such cases.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-29 15:28:45 +0800
58eea927d block: simplify empty barrier implementation ... Browse Code »

Empty barrier required special handling in __elv_next_request() to
complete it without letting the low level driver see it.

With previous changes, barrier code is now flexible enough to skip the
BAR step using the same barrier sequence selection mechanism. Drop
the special handling and mask off q->ordered from start_ordered().

Remove blk_empty_barrier() test which now has no user.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-29 15:28:45 +0800
8f11b3e99 block: make barrier completion more robust ... Browse Code »

Barrier completion had the following assumptions.

* start_ordered() couldn't finish the whole sequence properly. If all
actions are to be skipped, q->ordseq is set correctly but the actual
completion was never triggered thus hanging the barrier request.

* Drain completion in elv_complete_request() assumed that there's
always at least one request in the queue when drain completes.

Both assumptions are true but these assumptions need to be removed to
improve empty barrier implementation. This patch makes the following
changes.

* Make start_ordered() use blk_ordered_complete_seq() to mark skipped
steps complete and notify __elv_next_request() that it should fetch
the next request if the whole barrier has completed inside
start_ordered().

* Make drain completion path in elv_complete_request() check whether
the queue is empty. Empty queue also indicates drain completion.

* While at it, convert 0/1 return from blk_do_ordered() to false/true.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-29 15:28:45 +0800
f671620e7 block: make every barrier action optional ... Browse Code »

In all barrier sequences, the barrier write itself was always assumed
to be issued and thus didn't have corresponding control flag. This
patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
start_ordered() such that any barrier action can be skipped.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-29 15:28:45 +0800
a7384677b block: remove duplicate or unused barrier/discard error paths ... Browse Code »

* Because barrier mode can be changed dynamically, whether barrier is
supported or not can be determined only when actually issuing the
barrier and there is no point in checking it earlier. Drop barrier
support check in generic_make_request() and __make_request(), and
update comment around the support check in blk_do_ordered().

* There is no reason to check discard support in both
generic_make_request() and __make_request(). Drop the check in
__make_request(). While at it, move error action block to the end
of the function and add unlikely() to q existence test.

* Barrier request, be it empty or not, is never passed to low level
driver and thus it's meaningless to try to copy back req->sector to
bio->bi_sector on error. In addition, the notion of failed sector
doesn't make any sense for empty barrier to begin with. Drop the
code block from __end_that_request_first().

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-29 15:28:44 +0800
313e42999 block: reorganize QUEUE_ORDERED_* constants ... Browse Code »

Separate out ordering type (drain,) and action masks (preflush,
postflush, fua) from visible ordering mode selectors
(QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_*
while action masks are named QUEUE_ORDERED_DO_*.

This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
optional to improve empty barrier implementation.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-29 15:28:44 +0800
64d01dc9e block: use cancel_work_sync() instead of kblockd_flush_work() ... Browse Code »

After many improvements on kblockd_flush_work, it is now identical to
cancel_work_sync, so a direct call to cancel_work_sync is suggested.

The only difference is that cancel_work_sync is a GPL symbol,
so no non-GPL modules anymore.

Signed-off-by: Cheng Renquan
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Cheng Renquan
2008-12-29 15:28:44 +0800
08bafc034 block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set ... Browse Code »

Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.

This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.

During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.

My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.

I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.

Thanks for John Stultz for the quiet_error finalization.

Submitted-by: Keith Mannthey
Signed-off-by: Jens Axboe

Keith Mannthey
2008-12-29 15:28:44 +0800
7c239517d block: don't take lock on changing ra_pages ... Browse Code »

There's no need to take queue_lock or kernel_lock when modifying
bdi->ra_pages. So remove them. Also remove out of date comment for
queue_max_sectors_store().

Signed-off-by: Wu Fengguang
Signed-off-by: Jens Axboe

Wu Fengguang
2008-12-29 15:28:43 +0800
c6a06f707 block/blk-tag.c: cleanup kernel-doc ... Browse Code »

There is no argument named @tags in blk_init_tags,
remove its' comment.

Signed-off-by: Qinghuang Feng
Signed-off-by: Jens Axboe

Qinghuang Feng
2008-12-29 15:28:43 +0800
2b91bafcc scsi-ioctl: use clock_t <> jiffies ... Browse Code »

Convert the timeout ioctl scalling to use the clock_t functions
which are much more accurate with some USER_HZ vs HZ combinations.

Signed-off-by: Milton Miller
Signed-off-by: Jens Axboe

Milton Miller
2008-12-29 15:28:42 +0800
70ed28b92 block: leave the request timeout timer running even on an empty list ... Browse Code »

For sync IO, we'll often do them serialized. This means we'll be touching
the queue timer for every IO, as opposed to only occasionally like we
do for queued IO. Instead of deleting the timer when the last request
is removed, just let continue running. If a new request comes up soon
we then don't have to readd the timer again. If no new requests arrive,
the timer will expire without side effect later.

This improves high iops sync IO by ~1%.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:28:42 +0800
65d3618cc block: add comment in blk_rq_timed_out() about why next can not be 0 ... Browse Code »

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:28:42 +0800
565e411d7 block: optimizations in blk_rq_timed_out_timer() ... Browse Code »

Now the rq->deadline can't be zero if the request is in the
timeout_list, so there is no need to have next_set. There is no need to
access a request's deadline field if blk_rq_timed_out is called on it.

Signed-off-by: Malahal Naineni
Signed-off-by: Jens Axboe

malahal@us.ibm.com
2008-12-29 15:28:42 +0800

26 Dec, 2008

1 commit

be4d638c1 cpumask: Replace cpu_coregroup_map with cpu_coregroup_mask ... Browse Code »

cpu_coregroup_map returned a cpumask_t: it's going away.

(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).

Signed-off-by: Rusty Russell
Signed-off-by: Mike Travis
Cc: Jens Axboe
Cc: Ingo Molnar

Rusty Russell
2008-12-26 19:53:43 +0800

19 Dec, 2008

1 commit

30cd324e9 Merge branches 'tracing/ftrace', 'tracing/ring-buffer' and 'tracing/urgent' into tracing/core ... Browse Code »

Conflicts:
include/linux/ftrace.h

Ingo Molnar
2008-12-19 16:42:40 +0800

06 Dec, 2008

1 commit

f2f1fa78a Enforce a minimum SG_IO timeout ... Browse Code »

There's no point in having too short SG_IO timeouts, since if the
command does end up timing out, we'll end up through the reset sequence
that is several seconds long in order to abort the command that timed
out.

As a result, shorter timeouts than a few seconds simply do not make
sense, as the recovery would be longer than the timeout itself.

Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT.

Suggested-by: Alan Cox
Acked-by: Tejun Heo
Acked-by: Jens Axboe
Cc: Jeff Garzik
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-12-06 06:49:18 +0800

05 Dec, 2008

1 commit

970987beb Merge branches 'tracing/ftrace', 'tracing/function-graph-tracer' and 'tracing/ur… ... Browse Code »

…gent' into tracing/core

Ingo Molnar
2008-12-05 21:45:22 +0800

04 Dec, 2008

2 commits

fd4ce1acd [PATCH 1/2] kill FMODE_NDELAY_NOW ... Browse Code »

Update FMODE_NDELAY before each ioctl call so that we can kill the
magic FMODE_NDELAY_NOW. It would be even better to do this directly
in setfl(), but for that we'd need to have FMODE_NDELAY for all files,
not just block special files.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2008-12-04 17:22:57 +0800
1c925604e [PATCH] Fix block dev compat ioctl handling ... Browse Code »

Commit 33c2dca4957bd0da3e1af7b96d0758d97e708ef6 (trim file propagation
in block/compat_ioctl.c) removed the handling of some ioctls from
compat_blkdev_driver_ioctl. That caused them to be rejected as unknown
by the compat layer.

Signed-off-by: Andreas Schwab
Cc: Al Viro
Signed-off-by: Al Viro

Andreas Schwab
2008-12-04 17:22:55 +0800

03 Dec, 2008

4 commits

0e435ac26 block: fix setting of max_segment_size and seg_boundary mask ... Browse Code »

Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
devices.

When stacking devices (LVM over MD over SCSI) some of the request queue
parameters are not set up correctly in some cases by default, namely
max_segment_size and and seg_boundary mask.

If you create MD device over SCSI, these attributes are zeroed.

Problem become when there is over this mapping next device-mapper mapping
- queue attributes are set in DM this way:

request_queue max_segment_size seg_boundary_mask
SCSI 65536 0xffffffff
MD RAID1 0 0
LVM 65536 -1 (64bit)

Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
physical segments according to these parameters.

During the generic_make_request() is segment cout recalculated and can
increase bio->bi_phys_segments count over the allowed limit. (After
bio_clone() in stack operation.)

Thi is specially problem in CCISS driver, where it produce OOPS here

BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

(MAXSEGENTRIES is 31 by default.)

Sometimes even this command is enough to cause oops:

dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

This command generates bios with 250 sectors, allocated in 32 4k-pages
(last page uses only 1024 bytes).

For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
unfortunatelly on lower layer it is recalculated to 32 segments and this
violates CCISS restriction and triggers BUG_ON().

The patch tries to fix it by:

* initializing attributes above in queue request constructor
blk_queue_make_request()

* make sure that blk_queue_stack_limits() inherits setting

(DM uses its own function to set the limits because it
blk_queue_stack_limits() was introduced later. It should probably switch
to use generic stack limit function too.)

* sets the default seg_boundary value in one place (blkdev.h)

* use this mask as default in DM (instead of -1, which differs in 64bit)

Bugs related to this:
https://bugzilla.redhat.com/show_bug.cgi?id=471639
http://bugzilla.kernel.org/show_bug.cgi?id=8672

Signed-off-by: Milan Broz
Reviewed-by: Alasdair G Kergon
Cc: Neil Brown
Cc: FUJITA Tomonori
Cc: Tejun Heo
Cc: Mike Miller
Signed-off-by: Jens Axboe

Milan Broz
2008-12-03 19:55:55 +0800
53a08807c block: internal dequeue shouldn't start timer ... Browse Code »

blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
both start the timeout timer. Barrier code dequeues the original
barrier request but doesn't passes the request itself to lower level
driver, only broken down proxy requests; however, as the original
barrier code goes through the same dequeue path and timeout timer is
started on it. If barrier sequence takes long enough, this timer
expires but the low level driver has no idea about this request and
oops follows.

Timeout timer shouldn't have been started on the original barrier
request as it never goes through actual IO. This patch unexports
elv_dequeue_request(), which has no external user anyway, and makes it
operate on elevator proper w/o adding the timer and make
blkdev_dequeue_request() call elv_dequeue_request() and add timer.
Internal users which don't pass the request to driver - barrier code
and end_that_request_last() - are converted to use
elv_dequeue_request().

Signed-off-by: Tejun Heo
Cc: Mike Anderson
Signed-off-by: Jens Axboe

Tejun Heo
2008-12-03 19:41:26 +0800
bf91db18a block: set disk->node_id before it's being used ... Browse Code »

disk->node_id will be refered in allocating in disk_expand_part_tbl, so we
should set it before disk->node_id is refered.

Signed-off-by: Cheng Renquan
Signed-off-by: Jens Axboe

Cheng Renquan
2008-12-03 19:41:20 +0800
53cc0b294 When block layer fails to map iov, it calls bio_unmap_user to undo ... Browse Code »

mapping. Which is good if pages were mapped - but if they were provided
by someone else and just copied then bad things happen - pages are
released once here, and once by caller, leading to user triggerable BUG
at include/linux/mm.h:246.

Signed-off-by: Petr Vandrovec
Signed-off-by: Jens Axboe

Petr Vandrovec
2008-12-03 19:41:20 +0800

26 Nov, 2008

2 commits

0bfc24559 blktrace: port to tracepoints, update ... Browse Code »

Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
sites. Spread them out to the usage sites, as suggested by
Mathieu Desnoyers.

Signed-off-by: Ingo Molnar
Acked-by: Mathieu Desnoyers

Ingo Molnar
2008-11-26 20:04:35 +0800
5f3ea37c7 blktrace: port to tracepoints ... Browse Code »

This was a forward port of work done by Mathieu Desnoyers, I changed it to
encode the 'what' parameter on the tracepoint name, so that one can register
interest in specific events and not on classes of events to then check the
'what' parameter.

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: Jens Axboe
Signed-off-by: Ingo Molnar

Arnaldo Carvalho de Melo
2008-11-26 19:13:34 +0800

18 Nov, 2008

3 commits

c26156b25 block: hold extra reference to bio in blk_rq_map_user_iov() ... Browse Code »

If the size passed in is OK but we end up mapping too many segments,
we call the unmap path directly like from IO completion. But from IO
completion we have an extra reference to the bio, so this error case
goes OOPS when it attempts to free and already free bio.

Fix it by getting an extra reference to the bio before calling the
unmap failure case.

Reported-by: Petr Vandrovec

Signed-off-by: Jens Axboe

Jens Axboe
2008-11-18 22:08:56 +0800
561ec68e4 block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nash ... Browse Code »

We run into system boot failure with kernel 2.6.28-rc. We found it on a
couple of machines, including T61 notebook, nehalem machine, and another
HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9.
With kernel prior to 2.6.28-rc, system boot doesn't fail.

I debug it and locate the root cause. Pls. see
http://bugzilla.kernel.org/show_bug.cgi?id=11899
https://bugzilla.redhat.com/show_bug.cgi?id=471517

As a matter of fact, there are 2 bugs.

1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
and fails once. nash has a bug. Some of its functions misuse return
value 0. Sometimes, 0 means timeout and no uevent available. Sometimes,
0 means nash gets an uevent, but the uevent isn't block-related (for
exmaple, usb). If by coincidence, kernel tells nash that uevents are
available, but kernel also set timeout, nash might stops collecting
other uevents in queue if current uevent isn't block-related. I work
out a patch for nash to fix it.
http://bugzilla.kernel.org/attachment.cgi?id=18858

2) root=LABEL=/, system always can't boot. initrd init reports
switchroot fails. Here is an executation branch of nash when booting:
(1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
(2) nash query /proc/devices with the major number; It found line
"8 sd";
(3) nash use 'sd' to search its own probe table to find device (DISK)
type for the device and add it to its own list;
(4) Later on, it probes all devices in its list to get filesystem
labels; scsi register "8 sd" always.

When major is 259, nash fails to find the device(DISK) type. I enables
CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
for device /dev/sda1, which causes nash to fail to find device (DISK)
type.

To fixing issue 2), I create a patch for nash and another patch for
kernel.

http://bugzilla.kernel.org/attachment.cgi?id=18859
http://bugzilla.kernel.org/attachment.cgi?id=18837

Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
block device in proc/devices.

With 2 patches on nash and 1 patch on kernel, I boot my machines for
dozens of times without failure.

Signed-off-by Zhang Yanmin
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Zhang, Yanmin
2008-11-18 22:08:56 +0800
ba32929a9 block: make add_partition() return pointer to hd_struct ... Browse Code »

Make add_partition() return pointer to the new hd_struct on success
and ERR_PTR() value on failure. This change will be used to fix md
autodetection bug.

Signed-off-by: Tejun Heo
Cc: Neil Brown
Signed-off-by: Jens Axboe

Tejun Heo
2008-11-18 22:08:56 +0800