Eric Lee / smarc-fsl-linux-kernel

14 Dec, 2011

3 commits

7e5a87944 block, cfq: move io_cq exit/release to blk-ioc.c ... Browse Code »

With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:42 +0800
a73f730d0 block, cfq: move cfqd->cic_index to q->id ... Browse Code »

cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context. Move it to q->id and allocate on queue init and
free on queue release. This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:37 +0800
34f6055c8 block: add blk_queue_dead() ... Browse Code »

There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
macro and use it.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:37 +0800

19 Oct, 2011

2 commits

c9a929dde block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown ... Browse Code »

request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:

Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[] [] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[] elv_merge+0x84/0xe0
[] blk_queue_bio+0xf4/0x400
[] generic_make_request+0xca/0x100
[] submit_bio+0x74/0x100
[] dio_bio_submit+0xbc/0xc0
[] __blockdev_direct_IO+0x92e/0xb40
[] blkdev_direct_IO+0x57/0x60
[] generic_file_aio_read+0x6d5/0x760
[] do_sync_read+0xda/0x120
[] vfs_read+0xc5/0x180
[] sys_pread64+0x9a/0xb0
[] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.

* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:42:16 +0800
5c04b426f Merge branch 'v3.1-rc10' into for-3.2/core ... Browse Code »

Conflicts:
block/blk-core.c
include/linux/blkdev.h

Signed-off-by: Jens Axboe

Jens Axboe
2011-10-19 20:30:42 +0800

28 Sep, 2011

1 commit

777eb1bf1 block: Free queue resources at blk_release_queue() ... Browse Code »
44

A kernel crash is observed when a mounted ext3/ext4 filesystem is
physically removed. The problem is that blk_cleanup_queue() frees up
some resources eg by calling elevator_exit(), which are not checked for
in normal operation. So we should rather move these calls to the
destructor function blk_release_queue() as at that point all remaining
references are gone. However, in doing so we have to ensure that any
externally supplied queue_lock is disconnected as the driver might free
up the lock after the call of blk_cleanup_queue(),

Signed-off-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Hannes Reinecke
2011-09-28 22:07:01 +0800

21 Sep, 2011

1 commit

499337bb6 block/blk-sysfs.c: fix kerneldoc references ... Browse Code »

The kerneldoc for blk_release_queue() is referring to blk_cleanup_queue().

Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Andrew Morton
2011-09-21 16:01:22 +0800

24 Aug, 2011

1 commit

e8037d498 block: Fix queue_flag update when rq_affinity goes from 2 to 1 ... Browse Code »

Commit 5757a6d76cdf added the QUEUE_FLAG_SAME_FORCE flag, but fails to
clear that flag when the current state is '2' (SAME_COMP + SAME_FORCE)
and the new state is '1' (SAME_COMP).

Acked-by: Dan Williams
Reviewed-by: Roland Dreier
Signed-off-by: Eric Seppanen
Signed-off-by: Jens Axboe

Eric Seppanen
2011-08-24 14:51:34 +0800

24 Jul, 2011

1 commit

5757a6d76 block: strict rq_affinity ... Browse Code »

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block//queue/rq_affinity

Cc: Christoph Hellwig
Cc: Roland Dreier
Tested-by: Dave Jiang
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2011-07-24 02:44:25 +0800

21 May, 2011

1 commit

698567f3f Merge commit 'v2.6.39' into for-2.6.40/core ... Browse Code »

Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-21 02:33:15 +0800

18 May, 2011

1 commit

a934a00a6 block: Fix discard topology stacking and reporting ... Browse Code »

In some cases we would end up stacking discard_zeroes_data incorrectly.
Fix this by enabling the feature by default for stacking drivers and
clearing it for low-level drivers. Incorporating a device that does not
support dzd will then cause the feature to be disabled in the stacking
driver.

Also ensure that the maximum discard value does not overflow when
exported in sysfs and return 0 in the alignment and dzd fields for
devices that don't support discard.

Reported-by: Lukas Czerner
Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Martin K. Petersen
2011-05-18 16:37:35 +0800

19 Apr, 2011

2 commits

60735b636 block: Remove the extra check in queue_requests_store ... Browse Code »

In queue_requests_store, the code looks like
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_SYNC);
} else if (rl->count[BLK_RW_SYNC]+1 nr_requests) {
blk_clear_queue_full(q, BLK_RW_SYNC);
wake_up(&rl->wait[BLK_RW_SYNC]);
}
If we don't satify the situation of "if", we can get that
rl->count[BLK_RW_SYNC} < q->nr_quests. It is the same as
rl->count[BLK_RW_SYNC]+1 nr_requests.
All the "else" should satisfy the "else if" check so it isn't
needed actually.

Signed-off-by: Tao Ma
Signed-off-by: Jens Axboe

Tao Ma
2011-04-19 19:51:53 +0800
ed5302d3c block, blk-sysfs: Fix an err return path in blk_register_queue() ... Browse Code »

We do not call blk_trace_remove_sysfs() in err return path
if kobject_add() fails. This path fixes it.

Cc: stable@kernel.org
Signed-off-by: Liu Yuan
Signed-off-by: Jens Axboe

Liu Yuan
2011-04-19 19:51:53 +0800

14 Apr, 2011

1 commit

80656b67b block, blk-sysfs: Use the variable directly instead of a function call ... Browse Code »

In the function blk_register_queue(), var _dev_ is already assigned by
disk_to_dev().So use it directly instead of calling disk_to_dev() again.

Signed-off-by: Liu Yuan

Modified by me to delete an empty line in the same function while
in there anyway.

Signed-off-by: Jens Axboe

Liu Yuan
2011-04-14 04:14:54 +0800

03 Mar, 2011

1 commit

da5277700 block: Move blk_throtl_exit() call to blk_cleanup_queue() ... Browse Code »

Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
written in such a way that it needs queue lock. In blk_release_queue()
there is no gurantee that ->queue_lock is still around.

Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
one problem.

https://lkml.org/lkml/2010/10/23/86

And a quick fix moved blk_throtl_exit() to blk_release_queue().

commit 7ad58c028652753814054f4e3ac58f925e7343f4
Author: Jens Axboe
Date: Sat Oct 23 20:40:26 2010 +0200

block: fix use-after-free bug in blk throttle code

This patch reverts above change and does not try to shutdown the
throtl work in blk_sync_queue(). By avoiding call to
throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
the problem reported by Ingo.

blk_sync_queue() seems to be used only by md driver and it seems to be
using it to make sure q->unplug_fn is not called as md registers its
own unplug functions and it is about to free up the data structures
used by unplug_fn(). Block throttle does not call back into unplug_fn()
or into md. So there is no need to cancel blk throttle work.

In fact I think cancelling block throttle work is bad because it might
happen that some bios are throttled and scheduled to be dispatched later
with the help of pending work and if work is cancelled, these bios might
never be dispatched.

Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
blk_release_queue() time. That should be safe as we are also calling
blk_throtl_exit() which should make sure all the throttling related
data structures are cleaned up.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-03-03 08:06:49 +0800

17 Dec, 2010

1 commit

e692cb668 block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead ... Browse Code »

When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.

There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.

The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.

Reported-by: Ed Lin
Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Martin K. Petersen
2010-12-17 15:35:53 +0800

24 Oct, 2010

1 commit

7ad58c028 block: fix use-after-free bug in blk throttle code ... Browse Code »

blk_throtl_exit() frees the throttle data hanging off the queue
in blk_cleanup_queue(), but blk_put_queue() will indirectly
dereference this data when calling blk_sync_queue() which in
turns calls throtl_shutdown_timer_wq().

Fix this by moving the freeing of the throttle data to when
the queue is truly being released, and post the call to
blk_sync_queue().

Reported-by: Ingo Molnar
Tested-by: Ingo Molnar
Signed-off-by: Jens Axboe

Jens Axboe
2010-10-24 02:40:26 +0800

23 Oct, 2010

1 commit

e9dd2b683 Merge branch 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...

Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h

Linus Torvalds
2010-10-23 08:00:32 +0800

11 Sep, 2010

1 commit

13f05c8d8 block/scsi: Provide a limit on the number of integrity segments ... Browse Code »

Some controllers have a hardware limit on the number of protection
information scatter-gather list segments they can handle.

Introduce a max_integrity_segments limit in the block layer and provide
a new scsi_host_template setting that allows HBA drivers to provide a
value suitable for the hardware.

Add support for honoring the integrity segment limit when merging both
bios and requests.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2010-09-11 02:50:10 +0800

23 Aug, 2010

1 commit

c87ffbb81 block: put dev->kobj in blk_register_queue fail path ... Browse Code »

kernel needs to kobject_put on dev->kobj if elv_register_queue fails.

Signed-off-by: Xiaotian Feng
Cc: "Martin K. Petersen"
Cc: Stephen Hemminger
Cc: Nikanth Karthikesan
Cc: David Teigland
Signed-off-by: Jens Axboe

Xiaotian Feng
2010-08-23 18:30:29 +0800

08 Aug, 2010

2 commits

956bcb7c1 block: add helpers for the trivial queue flag sysfs show/store entries ... Browse Code »

The code for nonrot, random, and io stats are completely identical.

Signed-off-by: Jens Axboe

Jens Axboe
2010-08-08 00:13:50 +0800
e2e1a148b block: add sysfs knob for turning off disk entropy contributions ... Browse Code »

There are two reasons for doing this:

- On SSD disks, the completion times aren't as random as they
are for rotational drives. So it's questionable whether they
should contribute to the random pool in the first place.

- Calling add_disk_randomness() has a lot of overhead.

This adds /sys/block//queue/add_random that will allow you to
switch off on a per-device basis. The default setting is on, so there
should be no functional changes from this patch.

Signed-off-by: Jens Axboe

Jens Axboe
2010-08-08 00:13:00 +0800

10 Apr, 2010

1 commit

2f4084209 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits)
cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch
loop: Update mtime when writing using aops
block: expose the statistics in blkio.time and blkio.sectors for the root cgroup
backing-dev: Handle class_create() failure
Block: Fix block/elevator.c elevator_get() off-by-one error
drbd: lc_element_by_index() never returns NULL
cciss: unlock on error path
cfq-iosched: Do not merge queues of BE and IDLE classes
cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging
i2o: Remove the dangerous kobj_to_i2o_device macro
block: remove 16 bytes of padding from struct request on 64bits
cfq-iosched: fix a kbuild regression
block: make CONFIG_BLK_CGROUP visible
Remove GENHD_FL_DRIVERFS
block: Export max number of segments and max segment size in sysfs
block: Finalize conversion of block limits functions
block: Fix overrun in lcm() and move it to lib
vfs: improve writeback_inodes_wb()
paride: fix off-by-one test
drbd: fix al-to-on-disk-bitmap for 4k logical_block_size
...

Linus Torvalds
2010-04-10 02:50:29 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

19 Mar, 2010

1 commit

b4b7a4ef0 Merge branch 'master' into for-linus ... Browse Code »

Conflicts:
block/Kconfig

Signed-off-by: Jens Axboe

Jens Axboe
2010-03-19 15:05:10 +0800

15 Mar, 2010

1 commit

c77a5710b block: Export max number of segments and max segment size in sysfs ... Browse Code »

These two values are useful when debugging issues surrounding maximum
I/O size. Put them in sysfs with the rest of the queue limits.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2010-03-15 19:48:00 +0800

08 Mar, 2010

1 commit

52cf25d0a Driver core: Constify struct sysfs_ops in struct kobj_type ... Browse Code »

Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime

* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime

* potentially better optimized code as the compiler
can assume that the const data cannot be changed

* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing

Signed-off-by: Emese Revfy
Acked-by: David Teigland
Acked-by: Matt Domsch
Acked-by: Maciej Sosnowski
Acked-by: Hans J. Koch
Acked-by: Pekka Enberg
Acked-by: Jens Axboe
Acked-by: Stephen Hemminger
Signed-off-by: Greg Kroah-Hartman

Emese Revfy
2010-03-08 09:04:49 +0800

29 Jan, 2010

1 commit

488991e28 block: Added in stricter no merge semantics for block I/O ... Browse Code »

Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
merges at all are to be attempted (not even the simple one-hit cache).

The following table illustrates the additional benefit - 5 minute runs of
a random I/O load were applied to a dozen devices on a 16-way x86_64 system.

nomerges Throughput %System Improvement (tput / %sys)
-------- ------------ ----------- -------------------------
0 12.45 MB/sec 0.669365609
1 12.50 MB/sec 0.641519199 0.40% / 2.71%
2 12.52 MB/sec 0.639849750 0.56% / 2.96%

Signed-off-by: Alan D. Brunelle
Signed-off-by: Jens Axboe

Alan D. Brunelle
2010-01-29 16:04:08 +0800

03 Dec, 2009

1 commit

98262f276 block: Allow devices to indicate whether discarded blocks are zeroed ... Browse Code »

The discard ioctl is used by mkfs utilities to clear a block device
prior to putting metadata down. However, not all devices return zeroed
blocks after a discard. Some drives return stale data, potentially
containing old superblocks. It is therefore important to know whether
discarded blocks are properly zeroed.

Both ATA and SCSI drives have configuration bits that indicate whether
zeroes are returned after a discard operation. Implement a block level
interface that allows this information to be bubbled up the stack and
queried via a new block device ioctl.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-12-03 16:24:48 +0800

10 Nov, 2009

1 commit

86b372814 block: Expose discard granularity ... Browse Code »

While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger. Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-11-10 18:50:21 +0800

02 Oct, 2009

1 commit

48c0d4d4c Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs ... Browse Code »

Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
introduced in commit 1d54ad6da9192fed5dd3b60224d9f2dfea0dcd82.
Release kobject also in case the request_fn is NULL.

Problem was noticed via kmemleak backtrace when some sysfs entries were
note properly destroyed during device removal:

unreferenced object 0xffff88001aa76640 (size 80):
comm "lvcreate", pid 2120, jiffies 4294885144
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff .........e......
90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff .f........S.....
backtrace:
[] kmemleak_alloc+0x26/0x60
[] kmem_cache_alloc+0x133/0x1c0
[] sysfs_new_dirent+0x41/0x120
[] sysfs_add_file_mode+0x3c/0xb0
[] internal_create_group+0xc1/0x1a0
[] sysfs_create_group+0x13/0x20
[] blk_trace_init_sysfs+0x14/0x20
[] blk_register_queue+0x3c/0xf0
[] add_disk+0x94/0x160
[] dm_create+0x598/0x6e0 [dm_mod]
[] dev_create+0x51/0x350 [dm_mod]
[] ctl_ioctl+0x1a3/0x240 [dm_mod]
[] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
[] compat_sys_ioctl+0xcd/0x4f0
[] sysenter_dispatch+0x7/0x2c
[] 0xffffffffffffffff

Signed-off-by: Zdenek Kabelac
Reviewed-by: Li Zefan
Signed-off-by: Jens Axboe

Zdenek Kabelac
2009-10-02 03:15:46 +0800

14 Sep, 2009

1 commit

b8a9ae779 block: don't assume device has a request list backing in nr_requests store ... Browse Code »

Stacked devices do not. For now, just error out with -EINVAL. Later
we could make the limit apply on stacked devices too, for throttling
reasons.

This fixes

5a54cd13353bb3b88887604e2c980aa01e314309

and should go into 2.6.31 stable as well.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-14 14:24:53 +0800

02 Sep, 2009

1 commit

c295fc057 block: Allow changing max_sectors_kb above the default 512 ... Browse Code »

The patch "block: Use accessor functions for queue limits"
(ae03bf639a5027d27270123f5f6e3ee6a412781d) changed queue_max_sectors_store()
to use blk_queue_max_sectors() instead of directly assigning the value.

But blk_queue_max_sectors() differs a bit
1. It sets both max_sectors_kb, and max_hw_sectors_kb
2. Never allows one to change max_sectors_kb above BLK_DEF_MAX_SECTORS. If one
specifies a value greater then max_hw_sectors is set to that value but
max_sectors is set to BLK_DEF_MAX_SECTORS

I am not sure whether blk_queue_max_sectors() should be changed, as it seems
to be that way for a long time. And there may be callers dependent on that
behaviour.

This patch simply reverts to the older way of directly assigning the value to
max_sectors as it was before.

Signed-off-by: Nikanth Karthikesan
Signed-off-by: Jens Axboe

Nikanth Karthikesan
2009-09-02 04:40:15 +0800

17 Jul, 2009

1 commit

9cb308ce8 block: sysfs fix mismatched queue_var_{store,show} in 64bit kernel ... Browse Code »

In blk-sysfs.c, queue_var_store uses unsigned long to store data,
but queue_var_show uses unsigned int to show data. This causes,

# echo 70000000000 > /sys/block//queue/read_ahead_kb
# cat /sys/block//queue/read_ahead_kb => get wrong value

Fix it by using unsigned long.

While at it, convert queue_rq_affinity_show() such that it uses bool
variable instead of explicit != 0 testing.

Signed-off-by: Xiaotian Feng
Signed-off-by: Tejun Heo

Xiaotian Feng
2009-07-17 15:54:15 +0800

12 Jun, 2009

1 commit

c9059598e Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...

Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c

Linus Torvalds
2009-06-12 02:10:35 +0800

23 May, 2009

4 commits

c72758f33 block: Export I/O topology for block devices and partitions ... Browse Code »

To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment. This patch adds support for exposing
I/O topology characteristics as devices are stacked.

logical_block_size is the smallest unit the device can address.

physical_block_size indicates the smallest I/O the device can write
without incurring a read-modify-write penalty.

The io_min parameter is the smallest preferred I/O size reported by
the device. In many cases this is the same as the physical block
size. However, the io_min parameter can be scaled up when stacking
(RAID5 chunk size > physical block size).

The io_opt characteristic indicates the optimal I/O size reported by
the device. This is usually the stripe width for arrays.

The alignment_offset parameter indicates the number of bytes the start
of the device/partition is offset from the device's natural alignment.
Partition tools and MD/DM utilities can use this to pad their offsets
so filesystems start on proper boundaries.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-05-23 05:22:55 +0800
cd43e26f0 block: Expose stacked device queues in sysfs ... Browse Code »

Currently stacking devices do not have a queue directory in sysfs.
However, many of the I/O characteristics like sector size, maximum
request size, etc. are queue properties.

This patch enables the queue directory for MD/DM devices. The elevator
code has been modified to deal with queues that do not have an I/O
scheduler.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-05-23 05:22:55 +0800
ae03bf639 block: Use accessor functions for queue limits ... Browse Code »

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-05-23 05:22:54 +0800
e1defc4ff block: Do away with the notion of hardsect_size ... Browse Code »

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-05-23 05:22:54 +0800

07 May, 2009

1 commit

44347d947 Merge branch 'linus' into tracing/core ... Browse Code »

Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-05-07 17:17:34 +0800