Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

25 Jun, 2014

1 commit

66cb45aa4 block: add support for limiting gaps in SG lists ... Browse Code »
13

Another restriction inherited for NVMe - those devices don't support
SG lists that have "gaps" in them. Gaps refers to cases where the
previous SG entry doesn't end on a page boundary. For NVMe, all SG
entries must start at offset 0 (except the first) and end on a page
boundary (except the last).

Signed-off-by: Jens Axboe

Jens Axboe
2014-06-25 06:22:24 +0800

29 May, 2014

1 commit

05f1dd531 block: add queue flag for disabling SG merging ... Browse Code »
26

If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.

Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-29 23:53:32 +0800

08 Feb, 2014

1 commit

5cb8850c9 block: Explicitly handle discard/write same segments ... Browse Code »

Immutable biovecs changed the way biovecs are interpreted - drivers no
longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
using part of an existing segment without modifying it).

This breaks with discards and write_same bios, since for those bi_size
has nothing to do with segments in the biovec. So for now, we need a
fairly gross hack - we fortunately know that there will never be more
than one segment for the entire request, so we can special case
discard/write_same.

Signed-off-by: Kent Overstreet
Tested-by: Hugh Dickins
Signed-off-by: Jens Axboe

Kent Overstreet
2014-02-08 04:54:08 +0800

04 Dec, 2013

1 commit

2b8221e18 block: Really silence spurious compiler warnings ... Browse Code »

The uninitialized_var() macro appears to not work on structs...
Get rid of it, and manually initialize instead.

Signed-off-by: Kent Overstreet
Signed-off-by: Jens Axboe

Kent Overstreet
2013-12-04 05:29:09 +0800

27 Nov, 2013

1 commit

3f273d301 block: Silence spurious compiler warnings ... Browse Code »

Signed-off-by: Kent Overstreet
Signed-off-by: Jens Axboe

Kent Overstreet
2013-11-27 08:50:37 +0800

24 Nov, 2013

3 commits

f619d2546 block: Kill bio_iovec_idx(), __bio_iovec() ... Browse Code »

bio_iovec_idx() and __bio_iovec() don't have any valid uses anymore -
previous users have been converted to bio_iovec_iter() or other methods.

__BVEC_END() has to go too - the bvec array can't be used directly for
the last biovec because we might only be using the first portion of it,
we have to iterate over the bvec array with bio_for_each_segment() which
checks against the current value of bi_iter.bi_size.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe

Kent Overstreet
2013-11-24 14:33:53 +0800
7988613b0 block: Convert bio_for_each_segment() to bvec_iter ... Browse Code »

More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Geert Uytterhoeven
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Ed L. Cashin"
Cc: Nick Piggin
Cc: Lars Ellenberg
Cc: Jiri Kosina
Cc: Paul Clements
Cc: Jim Paris
Cc: Geoff Levand
Cc: Yehuda Sadeh
Cc: Sage Weil
Cc: Alex Elder
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris
Cc: Philip Kelleher
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Cc: Neil Brown
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama
Cc: Sreekanth Reddy
Cc: support@lsi.com
Cc: "James E.J. Bottomley"
Cc: Greg Kroah-Hartman
Cc: Alexander Viro
Cc: Steven Whitehouse
Cc: Herton Ronaldo Krzesinski
Cc: Tejun Heo
Cc: Andrew Morton
Cc: Guo Chao
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Matthew Wilcox
Cc: Keith Busch
Cc: Stephen Hemminger
Cc: Quoc-Son Anh
Cc: Sebastian Ott
Cc: Nitin Gupta
Cc: Minchan Kim
Cc: Jerome Marchand
Cc: Seth Jennings
Cc: "Martin K. Petersen"
Cc: Mike Snitzer
Cc: Vivek Goyal
Cc: "Darrick J. Wong"
Cc: Chris Metcalf
Cc: Jan Kara
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand

Kent Overstreet
2013-11-24 14:33:49 +0800
4f024f379 block: Abstract out bvec iterator ... Browse Code »
13

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Geert Uytterhoeven
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Ed L. Cashin"
Cc: Nick Piggin
Cc: Lars Ellenberg
Cc: Jiri Kosina
Cc: Matthew Wilcox
Cc: Geoff Levand
Cc: Yehuda Sadeh
Cc: Sage Weil
Cc: Alex Elder
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris
Cc: Philip Kelleher
Cc: Rusty Russell
Cc: "Michael S. Tsirkin"
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Cc: Neil Brown
Cc: Alasdair Kergon
Cc: Mike Snitzer
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh
Cc: Benny Halevy
Cc: "James E.J. Bottomley"
Cc: Greg Kroah-Hartman
Cc: "Nicholas A. Bellinger"
Cc: Alexander Viro
Cc: Chris Mason
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jaegeuk Kim
Cc: Steven Whitehouse
Cc: Dave Kleikamp
Cc: Joern Engel
Cc: Prasad Joshi
Cc: Trond Myklebust
Cc: KONISHI Ryusuke
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Ben Myers
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Len Brown
Cc: Pavel Machek
Cc: "Rafael J. Wysocki"
Cc: Herton Ronaldo Krzesinski
Cc: Ben Hutchings
Cc: Andrew Morton
Cc: Guo Chao
Cc: Tejun Heo
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Wei Yongjun
Cc: "Roger Pau Monné"
Cc: Jan Beulich
Cc: Stefano Stabellini
Cc: Ian Campbell
Cc: Sebastian Ott
Cc: Christian Borntraeger
Cc: Minchan Kim
Cc: Jiang Liu
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Joe Perches
Cc: Peng Tao
Cc: Andy Adamson
Cc: fanchaoting
Cc: Jie Liu
Cc: Sunil Mushran
Cc: "Martin K. Petersen"
Cc: Namjae Jeon
Cc: Pankaj Kumar
Cc: Dan Magenheimer
Cc: Mel Gorman 6

Kent Overstreet
2013-11-24 14:33:47 +0800

30 Oct, 2013

1 commit

e7e245000 blk-mq: don't disallow request merges for req->special being set ... Browse Code »

For blk-mq, if a driver has requested per-request payload data
to carry command structures, they are stuffed into req->special.
For an old style request based driver, req->special is used
for the same purpose but indicates that a per-driver request
structure has been prepared for the request already. So for the
old style driver, we do not merge such requests.

As most/all blk-mq drivers will use the payload feature, and
since we have no problem merging on these, make this check
dependent on whether it's a blk-mq enabled driver or not.

Reported-by: Shaohua Li
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-30 02:11:47 +0800

20 Mar, 2013

1 commit

c8164d893 scatterlist: introduce sg_unmark_end ... Browse Code »

This is useful in places that recycle the same scatterlist multiple
times, and do not want to incur the cost of sg_init_table every
time in hot paths.

Acked-by: Jens Axboe
Signed-off-by: Paolo Bonzini
Signed-off-by: Rusty Russell

Paolo Bonzini
2013-03-20 13:13:04 +0800

20 Sep, 2012

3 commits

4363ac7c1 block: Implement support for WRITE SAME ... Browse Code »

The WRITE SAME command supported on some SCSI devices allows the same
block to be efficiently replicated throughout a block range. Only a
single logical block is transferred from the host and the storage device
writes the same data to all blocks described by the I/O.

This patch implements support for WRITE SAME in the block layer. The
blkdev_issue_write_same() function can be used by filesystems and block
drivers to replicate a buffer across a block range. This can be used to
efficiently initialize software RAID devices, etc.

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:45 +0800
f31dc1cd4 block: Consolidate command flag and queue limit checks for merges ... Browse Code »

- blk_check_merge_flags() verifies that cmd_flags / bi_rw are
compatible. This function is called for both req-req and req-bio
merging.

- blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
to query the maximum sector count for a given request or queue. The
calls will return the right value from the queue limits given the
type of command (RW, discard, write same, etc.)

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:41 +0800
e2a60da74 block: Clean up special command handling logic ... Browse Code »
5

Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.

bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:38 +0800

03 Aug, 2012

2 commits

85b9f66a4 block: Add blk_bio_map_sg() helper ... Browse Code »

Add a helper to map a bio to a scatterlist, modelled after
blk_rq_map_sg.

This helper is useful for any driver that wants to create
a scatterlist from its ->make_request_fn method.

Changes in v2:
- Use __blk_segment_map_sg to avoid duplicated code
- Add cocbook style function comment

Cc: Rusty Russell
Cc: Christoph Hellwig
Cc: Tejun Heo
Cc: Shaohua Li
Cc: "Michael S. Tsirkin"
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig
Signed-off-by: Minchan Kim
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-08-03 05:42:04 +0800
963ab9e5d block: Introduce __blk_segment_map_sg() helper ... Browse Code »

Split the mapping code in blk_rq_map_sg() to a helper
__blk_segment_map_sg(), so that other mapping function, e.g.
blk_bio_map_sg(), can share the code.

Cc: Rusty Russell
Cc: Christoph Hellwig
Cc: Tejun Heo
Cc: Shaohua Li
Cc: "Michael S. Tsirkin"
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Suggested-by: Jens Axboe
Suggested-by: Tejun Heo
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-08-03 05:42:03 +0800

08 Feb, 2012

1 commit

050c8ea80 block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions ... Browse Code »

blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo
LKML-Reference:
Original-patch-by: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-08 16:19:38 +0800

21 Mar, 2011

1 commit

5e84ea3a9 block: attempt to merge with existing requests on plug flush ... Browse Code »

One of the disadvantages of on-stack plugging is that we potentially
lose out on merging since all pending IO isn't always visible to
everybody. When we flush the on-stack plugs, right now we don't do
any checks to see if potential merge candidates could be utilized.

Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
It works just ELEVATOR_INSERT_SORT, but first checks whether we can
merge with an existing request before doing the insertion (if we fail
merging).

This fixes a regression with multiple processes issuing IO that
can be merged.

Thanks to Shaohua Li for testing and fixing
an accounting bug.

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-21 17:14:27 +0800

14 Jan, 2011

1 commit

275220f0f Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...

Linus Torvalds
2011-01-14 02:45:01 +0800

07 Jan, 2011

1 commit

6c23a9681 block: add internal hd part table references ... Browse Code »

We can't use krefs since it's apparently restricted to very basic
reference counting.

This reverts commit e4a683c8.

Signed-off-by: Jens Axboe

Jens Axboe
2011-01-07 15:43:37 +0800

05 Jan, 2011

1 commit

09e099d4b block: fix accounting bug on cross partition merges ... Browse Code »

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.

| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.

Signed-off-by: Jerome Marchand
Signed-off-by: Yasuaki Ishimatsu
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jerome Marchand
2011-01-05 23:57:38 +0800

17 Dec, 2010

1 commit

e692cb668 block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead ... Browse Code »

When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.

There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.

The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.

Reported-by: Ed Lin
Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Martin K. Petersen
2010-12-17 15:35:53 +0800

25 Oct, 2010

1 commit

f253b86b4 Revert "block: fix accounting bug on cross partition merges" ... Browse Code »

This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327.

Conflicts:

include/linux/genhd.h

It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.

Signed-off-by: Jens Axboe

Jens Axboe
2010-10-25 04:06:02 +0800

23 Oct, 2010

1 commit

e9dd2b683 Merge branch 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...

Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h

Linus Torvalds
2010-10-23 08:00:32 +0800

19 Oct, 2010

1 commit

7681bfeec block: fix accounting bug on cross partition merges ... Browse Code »

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.

| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.

Signed-off-by: Yasuaki Ishimatsu
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Yasuaki Ishimatsu
2010-10-19 15:07:02 +0800

25 Sep, 2010

1 commit

f281fb5fe block: prevent merges of discard and write requests ... Browse Code »

Add logic to prevent two I/O requests being merged if
only one of them is a discard. Ditto secure discard.

Without this fix, it is possible for write requests
to transform into discard requests. For example:

Submit bio 1 to discard 8 sectors from sector n
Submit bio 2 to write 8 sectors from sector n + 16
Submit bio 3 to write 8 sectors from sector n + 8

Bio 1 becomes request 1. Bio 2 becomes request 2.
Bio 3 is merged with request 2, and then subsequently
request 2 is merged with request 1 resulting in just
one I/O request which discards all 24 sectors.

Signed-off-by: Adrian Hunter

(Moved the checks above the position checks /Jens)

Signed-off-by: Jens Axboe

Adrian Hunter
2010-09-25 18:42:55 +0800

11 Sep, 2010

1 commit

13f05c8d8 block/scsi: Provide a limit on the number of integrity segments ... Browse Code »

Some controllers have a hardware limit on the number of protection
information scatter-gather list segments they can handle.

Introduce a max_integrity_segments limit in the block layer and provide
a new scsi_host_template setting that allows HBA drivers to provide a
value suitable for the hardware.

Add support for honoring the integrity segment limit when merging both
bios and requests.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2010-09-11 02:50:10 +0800

08 Aug, 2010

3 commits

2c8919dee gcc-4.6: block: fix unused but set variables in blk-merge ... Browse Code »

Just some dead code.

Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Andi Kleen
2010-08-08 00:23:10 +0800
7b6d91dae block: unify flags for struct bio and struct request ... Browse Code »

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-08-08 00:20:39 +0800
33659ebba block: remove wrappers for request type/flags ... Browse Code »

Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests. This allows much easier grepping for different request
types instead of unwinding through macros.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-08-08 00:17:56 +0800

26 Feb, 2010

1 commit

8a78362c4 block: Consolidate phys_segment and hw_segment limits ... Browse Code »

Except for SCSI no device drivers distinguish between physical and
hardware segment limits. Consolidate the two into a single segment
limit.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2010-02-26 20:58:08 +0800

07 Oct, 2009

1 commit

316d315bf block: Seperate read and write statistics of in_flight requests v2 ... Browse Code »

Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.

But Corrado Zoccolo reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.

So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b

The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;

- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));

With this chunk included, the reported regression gets fixed.

Signed-off-by: Nikanth Karthikesan

--
Signed-off-by: Jens Axboe

Nikanth Karthikesan
2009-10-07 02:16:55 +0800

05 Oct, 2009

1 commit

0f78ab989 Revert "Seperate read and write statistics of in_flight requests" ... Browse Code »

This reverts commit a9327cac440be4d8333bba975cbbf76045096275.

Corrado Zoccolo reports:

"with 2.6.32-rc1 I started getting the following strange output from
"iostat -kx 2":
Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
10,70 0,00 3,16 15,75 0,00 70,38

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 18,22 0,00 0,67 0,01 14,77 0,02
43,94 0,01 10,53 39043915,03 2629219,87
sdb 60,89 9,68 50,79 3,04 1724,43 50,52
65,95 0,70 13,06 488437,47 2629219,87

avg-cpu: %user %nice %system %iowait %steal %idle
2,72 0,00 0,74 0,00 0,00 96,53

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00

avg-cpu: %user %nice %system %iowait %steal %idle
6,68 0,00 0,99 0,00 0,00 92,33

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00

avg-cpu: %user %nice %system %iowait %steal %idle
4,40 0,00 0,73 1,47 0,00 93,40

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 4,00 0,00 3,00 0,00 28,00
18,67 0,06 19,50 333,33 100,00

Global values for service time and utilization are garbage. For
interval values, utilization is always 100%, and service time is
higher than normal.

I bisected it down to:
[a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
statistics of in_flight requests
and verified that reverting just that commit indeed solves the issue
on 2.6.32-rc1."

So until this is debugged, revert the bad commit.

Signed-off-by: Jens Axboe

Jens Axboe
2009-10-05 03:04:38 +0800

14 Sep, 2009

1 commit

a9327cac4 Seperate read and write statistics of in_flight requests ... Browse Code »

Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.

Signed-off-by: Nikanth Karthikesan
Signed-off-by: Jens Axboe

Nikanth Karthikesan
2009-09-14 14:24:52 +0800

11 Sep, 2009

2 commits

da6c5c720 scsi,block: update SCSI to handle mixed merge failures ... Browse Code »

Update scsi_io_completion() such that it only fails requests till the
next error boundary and retry the leftover. This enables block layer
to merge requests with different failfast settings and still behave
correctly on errors. Allow merge of requests of different failfast
settings.

As SCSI is currently the only subsystem which follows failfast status,
there's no need to worry about other block drivers for now.

Signed-off-by: Tejun Heo
Cc: Niel Lambrechts
Cc: James Bottomley
Signed-off-by: Jens Axboe

Tejun Heo
2009-09-11 20:33:30 +0800
80a761fd3 block: implement mixed merge of different failfast requests ... Browse Code »

Failfast has characteristics from other attributes. When issuing,
executing and successuflly completing requests, failfast doesn't make
any difference. It only affects how a request is handled on failure.
Allowing requests with different failfast settings to be merged cause
normal IOs to fail prematurely while not allowing has performance
penalties as failfast is used for read aheads which are likely to be
located near in-flight or to-be-issued normal IOs.

This patch introduces the concept of 'mixed merge'. A request is a
mixed merge if it is merge of segments which require different
handling on failure. Currently the only mixable attributes are
failfast ones (or lack thereof).

When a bio with different failfast settings is added to an existing
request or requests of different failfast settings are merged, the
merged request is marked mixed. Each bio carries failfast settings
and the request always tracks failfast state of the first bio. When
the request fails, blk_rq_err_bytes() can be used to determine how
many bytes can be safely failed without crossing into an area which
requires further retrials.

This allows request merging regardless of failfast settings while
keeping the failure handling correct.

This patch only implements mixed merge but doesn't enable it. The
next one will update SCSI to make use of mixed merge.

Signed-off-by: Tejun Heo
Cc: Niel Lambrechts
Signed-off-by: Jens Axboe

Tejun Heo
2009-09-11 20:33:30 +0800

04 Jul, 2009

1 commit

ab0fd1deb block: don't merge requests of different failfast settings ... Browse Code »

Block layer used to merge requests and bios with different failfast
settings. This caused regular IOs to fail prematurely when they were
merged into failfast requests for readahead.

Niel Lambrechts could trigger the problem semi-reliably on ext4 when
resuming from STR. ext4 uses readahead when reading inodes and
combined with the deterministic extra SATA PHY exception cycle during
resume on the specific configuration, non-readahead inode read would
fail causing ext4 errors. Please read the following thread for
details.

http://lkml.org/lkml/2009/5/23/21

This patch makes block layer reject merging if the failfast settings
don't match. This is correct but likely to lower IO performance by
preventing regular IOs from mingling into surrounding readahead
requests. Changes to allow such mixed merges and handle errors
correctly will be added later.

Signed-off-by: Tejun Heo
Reported-by: Niel Lambrechts
Cc: Theodore Tso
Signed-off-by: Jens Axboe

Tejun Heo
2009-07-04 03:06:45 +0800

23 May, 2009

1 commit

ae03bf639 block: Use accessor functions for queue limits ... Browse Code »

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-05-23 05:22:54 +0800

11 May, 2009

3 commits

a2dec7b36 block: hide request sector and data_len ... Browse Code »

Block low level drivers for some reason have been pretty good at
abusing block layer API. Especially struct request's fields tend to
get violated in all possible ways. Make it clear that low level
drivers MUST NOT access or manipulate rq->sector and rq->data_len
directly by prefixing them with double underscores.

This change is also necessary to break build of out-of-tree codes
which assume the previous block API where internal fields can be
manipulated and rq->data_len carries residual count on completion.

[ Impact: hide internal fields, block API change ]

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2009-05-11 15:50:55 +0800
2e46e8b27 block: drop request->hard_* and *nr_sectors ... Browse Code »

struct request has had a few different ways to represent some
properties of a request. ->hard_* represent block layer's view of the
request progress (completion cursor) and the ones without the prefix
are supposed to represent the issue cursor and allowed to be updated
as necessary by the low level drivers. The thing is that as block
layer supports partial completion, the two cursors really aren't
necessary and only cause confusion. In addition, manual management of
request detail from low level drivers is cumbersome and error-prone at
the very least.

Another interesting duplicate fields are rq->[hard_]nr_sectors and
rq->{hard_cur|current}_nr_sectors against rq->data_len and
rq->bio->bi_size. This is more convoluted than the hard_ case.

rq->[hard_]nr_sectors are initialized for requests with bio but
blk_rq_bytes() uses it only for !pc requests. rq->data_len is
initialized for all request but blk_rq_bytes() uses it only for pc
requests. This causes good amount of confusion throughout block layer
and its drivers and determining the request length has been a bit of
black magic which may or may not work depending on circumstances and
what the specific LLD is actually doing.

rq->{hard_cur|current}_nr_sectors represent the number of sectors in
the contiguous data area at the front. This is mainly used by drivers
which transfers data by walking request segment-by-segment. This
value always equals rq->bio->bi_size >> 9. However, data length for
pc requests may not be multiple of 512 bytes and using this field
becomes a bit confusing.

In general, having multiple fields to represent the same property
leads only to confusion and subtle bugs. With recent block low level
driver cleanups, no driver is accessing or manipulating these
duplicate fields directly. Drop all the duplicates. Now rq->sector
means the current sector, rq->data_len the current total length and
rq->bio->bi_size the current segment length. Everything else is
defined in terms of these three and available only through accessors.

* blk_recalc_rq_sectors() is collapsed into blk_update_request() and
now handles pc and fs requests equally other than rq->sector update.
This means that now pc requests can use partial completion too (no
in-kernel user yet tho).

* bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
now uses byte count as the primary data length.

* blk_rq_pos() is now guranteed to be always correct. In-block users
converted.

* blk_rq_bytes() is now guaranteed to be always valid as is
blk_rq_sectors(). In-block users converted.

* blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
More convenient one is used.

* blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
pointer to request.

[ Impact: API cleanup, single way to represent one property of a request ]

Signed-off-by: Tejun Heo
Cc: Boaz Harrosh
Signed-off-by: Jens Axboe

Tejun Heo
2009-05-11 15:50:54 +0800
83096ebf1 block: convert to pos and nr_sectors accessors ... Browse Code »

With recent cleanups, there is no place where low level driver
directly manipulates request fields. This means that the 'hard'
request fields always equal the !hard fields. Convert all
rq->sectors, nr_sectors and current_nr_sectors references to
accessors.

While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

[ Impact: use pos and nr_sectors accessors ]

Signed-off-by: Tejun Heo
Acked-by: Geert Uytterhoeven
Tested-by: Grant Likely
Acked-by: Grant Likely
Tested-by: Adrian McMenamin
Acked-by: Adrian McMenamin
Acked-by: Mike Miller
Cc: James Bottomley
Cc: Bartlomiej Zolnierkiewicz
Cc: Borislav Petkov
Cc: Sergei Shtylyov
Cc: Eric Moore
Cc: Alan Stern
Cc: FUJITA Tomonori
Cc: Pete Zaitcev
Cc: Stephen Rothwell
Cc: Paul Clements
Cc: Tim Waugh
Cc: Jeff Garzik
Cc: Jeremy Fitzhardinge
Cc: Alex Dubov
Cc: David Woodhouse
Cc: Martin Schwidefsky
Cc: Dario Ballabio
Cc: David S. Miller
Cc: Rusty Russell
Cc: unsik Kim
Cc: Laurent Vivier
Signed-off-by: Jens Axboe

Tejun Heo
2009-05-11 15:50:54 +0800