22 Oct, 2015
3 commits
-
A trace like the following proceeds a crash in bio_integrity_process()
when it goes to use an already freed blk_integrity profile.BUG: unable to handle kernel paging request at ffff8800d31b10d8
IP: [] 0xffff8800d31b10d8
PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3
Oops: 0011 [#1] SMP
Dumping ftrace buffer:
---------------------------------
ndctl-2222 2.... 44526245us : disk_release: pmem1s
systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s
-409 4.... 44574005us : bio_integrity_process: pmem1s
---------------------------------
[..]
Call Trace:
[] ? bio_integrity_process+0x159/0x2d0
[] bio_integrity_verify_fn+0x36/0x60
[] process_one_work+0x1cc/0x4e0Given that a request_queue is pinned while i/o is in flight and that a
gendisk is allowed to have a shorter lifetime, move blk_integrity to
request_queue to satisfy requests arriving after the gendisk has been
torn down.Cc: Christoph Hellwig
Cc: Martin K. Petersen
[martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe -
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:- Moving the blk_integrity definition to genhd.h.
- Inlining blk_integrity in struct gendisk.
- Removing the dynamic allocation code.
- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.- Simplifying the integrity support routines in DM (Mike Snitzer).
Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Mike Snitzer
Cc: Dan Williams
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe -
The integrity kobject purely exists to support the integrity
subdirectory in sysfs and doesn't really have anything to do with the
blk_integrity data structure. Move the kobject to struct gendisk where
it belongs.Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe
17 Jul, 2015
2 commits
-
Percpu refcount is the perfect match for partition's case,
and the conversion is quite straight.With the convertion, one pair of atomic inc/dec can be saved
for accounting block I/O, which is run in hot path of block I/O.Signed-off-by: Ming Lei
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe -
So the helper can be used in both generic partition
case and part0 case.Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
18 Apr, 2014
1 commit
-
Mostly scripted conversion of the smp_mb__* barriers.
Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar
26 Feb, 2013
1 commit
-
Commit "85865c1 ima: add policy support for file system uuid"
introduced a CONFIG_BLOCK dependency. This patch defines a
wrapper called blk_part_pack_uuid(), which returns -EINVAL,
when CONFIG_BLOCK is not defined.security/integrity/ima/ima_policy.c:538:4: error: implicit declaration
of function 'part_pack_uuid' [-Werror=implicit-function-declaration]Changelog v2:
- Reference commit number in patch description
Changelog v1:
- rename ima_part_pack_uuid() to blk_part_pack_uuid()
- resolve scripts/checkpatch.pl warnings
Changelog v0:
- fix UUID scripts/Lindent msgsReported-by: Randy Dunlap
Reported-by: David Rientjes
Signed-off-by: Mimi Zohar
Acked-by: David Rientjes
Acked-by: Randy Dunlap
Cc: Jens Axboe
Signed-off-by: James Morris
23 Nov, 2012
1 commit
-
This will allow other types of UUID to be stored here, aside from true
UUIDs. This also simplifies code that uses this field, since it's usually
constructed from a, used as a, or compared to other, strings.Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
/PARTNROFF option was found to be present. However, this modifies the
input string, and causes subsequent calls to devt_from_partuuid() not to
see the /PARTNROFF option, which causes different results. In order to
avoid misleading future maintainers, this parameter is marked const.Signed-off-by: Stephen Warren
Cc: Tejun Heo
Cc: Will Drewry
Cc: Kay Sievers
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe
02 Aug, 2012
1 commit
-
Pull core block IO bits from Jens Axboe:
"The most complicated part if this is the request allocation rework by
Tejun, which has been queued up for a long time and has been in
for-next ditto as well.There are a few commits from yesterday and today, mostly trivial and
obvious fixes. So I'm pretty confident that it is sound. It's also
smaller than usual."* 'for-3.6/core' of git://git.kernel.dk/linux-block:
block: remove dead func declaration
block: add partition resize function to blkpg ioctl
block: uninitialized ioc->nr_tasks triggers WARN_ON
block: do not artificially constrain max_sectors for stacking drivers
blkcg: implement per-blkg request allocation
block: prepare for multiple request_lists
block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
blkcg: inline bio_blkcg() and friends
block: allocate io_context upfront
block: refactor get_request[_wait]()
block: drop custom queue draining used by scsi_transport_{iscsi|fc}
mempool: add @gfp_mask to mempool_create_node()
blkcg: make root blkcg allocation use %GFP_KERNEL
blkcg: __blkg_lookup_create() doesn't need radix preload
01 Aug, 2012
1 commit
-
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
Signed-off-by: Vivek Goyal
Signed-off-by: Phillip Susi
Signed-off-by: Jens Axboe
17 Jul, 2012
1 commit
-
This function is not really specific to the genhd layer and there are various
re-implementations or open-coded variants of it all throughout the kernel. To
avoid further duplications move the function to a more generic place.While moving also convert it from a macro to a inline function.
Potential users of this function can be detected and converted using the
following coccinelle patch://
@@
expression k;
@@
-container_of(k, struct device, kobj)
+kobj_to_dev(kobj)
//Signed-off-by: Lars-Peter Clausen
Signed-off-by: Greg Kroah-Hartman
15 May, 2012
1 commit
-
6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct"
added part_unpack_uuid() which assumes that the passed in buffer has
enough space for sprintfing "%pU" - 37 characters including '\0'.Unfortunately, b5af921ec0233 "init: add support for root devices
specified by partition UUID" supplied 33 bytes buffer to the function
leading to the following panic with stackprotector enabled.Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
[] panic+0xba/0x1c6
[] ? printk_all_partitions+0x259/0x26xb
[] __stack_chk_fail+0x1b/0x20
[] printk_all_paritions+0x259/0x26xb
[] mount_block_root+0x1bc/0x27f
[] mount_root+0x57/0x5b
[] prepare_namespace+0x13d/0x176
[] ? release_tgcred.isra.4+0x330/0x30
[] kernel_init+0x155/0x15a
[] ? schedule_tail+0x27/0xb0
[] kernel_thread_helper+0x5/0x10
[] ? start_kernel+0x3c5/0x3c5
[] ? gs_change+0x13/0x13Increase the buffer size, remove the dangerous part_unpack_uuid() and
use snprintf() directly from printk_all_partitions().Signed-off-by: Tejun Heo
Reported-by: Szymon Gruszczynski
Cc: Will Drewry
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
02 Mar, 2012
1 commit
-
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.However it ends up calling driver's revalidate_disk without open
and could cause oops.In the case of SCSI:
process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device
rescan_partitions
sd_revalidate_disk
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.Reported-by: Huajun Li
Signed-off-by: Jun'ichi Nomura
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe
04 Jan, 2012
1 commit
-
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.Signed-off-by: Al Viro
10 Nov, 2011
1 commit
-
This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10.
The commit introduced alias for block devices which is intended to be
used during logging although actual usage hasn't been committed yet.
This approach adds very limited benefit (raw log might be easier to
follow) which can be trivially implemented in userland but has a lot
of problems.It is much worse than netif renames because it doesn't rename the
actual device but just adds conveninence name which isn't used
universally or enforced. Everything internal including device lookup
and sysfs still uses the internal name and nothing prevents two
devices from using conflicting alias - ie. sda can have sdb as its
alias.This has been nacked by people working on device driver core, block
layer and kernel-userland interface and shouldn't have been
upstreamed. Revert it.http://thread.gmane.org/gmane.linux.kernel/1155104
http://thread.gmane.org/gmane.linux.scsi/68632
http://thread.gmane.org/gmane.linux.scsi/69776Signed-off-by: Tejun Heo
Acked-by: Greg Kroah-Hartman
Acked-by: Kay Sievers
Cc: "James E.J. Bottomley"
Cc: Nao Nishijima
Cc: Alan Cox
Cc: Al Viro
Signed-off-by: Jens Axboe
05 Nov, 2011
1 commit
-
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...Fic up trivial header file includsion conflict in drivers/block/loop.c
29 Aug, 2011
1 commit
-
This patch allows the user to set an "alias" of the disk via sysfs interface.
This patch only adds a new attribute "alias" in gendisk structure.
To show the alias instead of the device name in kernel messages,
we need to revise printk messages and use alias_name() in them.Example:
(current) printk("disk name is %s\n", disk->disk_name);
(new) printk("disk name is %s\n", alias_name(disk));Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
have an "alias" which length is up to 255 bytes. This attribute is write-once.Suggested-by: James Bottomley
Suggested-by: Jon Masters
Signed-off-by: Nao Nishijima
Signed-off-by: James Bottomley
24 Aug, 2011
1 commit
-
There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions. disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.-v2: Updated to make it clear that only partition scan is suppressed
not partitioning itself as suggested by Kay Sievers.Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe
01 Jul, 2011
1 commit
-
Currently, only open(2) is defined as the 'clearing' point. It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events. Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage. Among other
things, GET_EVENT can't detect media removal while TUR can.This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers. disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).Note: Given the current usage, it seems @clearing hint is needlessly
complex. disk_clear_events() can simply clear all events and the hint
can be boolean @flush.Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe
30 May, 2011
1 commit
-
It was not a good idea to start dereferencing disk->queue from
the fs sysfs strategy for displaying discard alignment. We ran
into first a NULL pointer deref, and after fixing that we sometimes
see unvalid disk->queue pointer values.Since discard is the only one of the bunch actually looking into
the queue, just revert the change.This reverts commit 23ceb5b7719e9276d4fa72a3ecf94dd396755276.
Conflicts:
fs/partitions/check.c
07 May, 2011
1 commit
-
Currently, hd_struct.discard_alignment is only used when we
show /sys/block/sdx/sdx/discard_alignment. So remove it and
calculate when it is asked to show.Signed-off-by: Tao Ma
Signed-off-by: Jens Axboe
22 Apr, 2011
1 commit
-
Disk event code automatically blocks events on excl write. This is
primarily to avoid issuing polling commands while burning is in
progress. This behavior doesn't fit other types of devices with
removeable media where polling commands don't have adverse side
effects and door locking usually doesn't exist.This patch introduces new genhd flag which controls the auto-blocking
behavior and uses it to enable auto-blocking only on optical devices.Note for stable: 2.6.38 and later only
Cc: stable@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Kay Sievers
Signed-off-by: Jens Axboe
22 Mar, 2011
1 commit
-
After the stack plugging introduction, these are called lockless.
Ensure that the counters are updated atomically.Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe
13 Jan, 2011
1 commit
07 Jan, 2011
1 commit
-
We can't use krefs since it's apparently restricted to very basic
reference counting.This reverts commit e4a683c8.
Signed-off-by: Jens Axboe
05 Jan, 2011
1 commit
-
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.Signed-off-by: Jerome Marchand
Signed-off-by: Yasuaki Ishimatsu
Cc: stable@kernel.org
Signed-off-by: Jens Axboe
17 Dec, 2010
3 commits
-
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).This patch implements framework for in-kernel disk event handling,
which includes media presence polling.* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.Signed-off-by: Tejun Heo
Cc: Kay Sievers
Cc: Jan Kara
Signed-off-by: Jens Axboe -
There's no reason for register_disk() and del_gendisk() to be in
fs/partitions/check.c. Move both to genhd.c. While at it, collapse
unlink_gendisk(), which was artificially in a separate function due to
genhd.c / check.c split, into del_gendisk().Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
There's no user of the facility. Kill it.
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe
25 Oct, 2010
1 commit
-
This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327.
Conflicts:
include/linux/genhd.h
It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.Signed-off-by: Jens Axboe
23 Oct, 2010
1 commit
-
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
19 Oct, 2010
1 commit
-
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.Signed-off-by: Yasuaki Ishimatsu
Cc: stable@kernel.org
Signed-off-by: Jens Axboe
15 Sep, 2010
1 commit
-
I'm reposting this patch series as v4 since there have been no additional
comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
are against Linus's tree: 2bfc96a127bc1cc94d26bfaa40159966064f9c8c
(2.6.36-rc3).Would this patchset be suitable for inclusion in an mm branch?
This changes adds a partition_meta_info struct which itself contains a
union of structures that provide partition table specific metadata.This change leaves the union empty. The subsequent patch includes an
implementation for CONFIG_EFI_PARTITION-based metadata.Signed-off-by: Will Drewry
Signed-off-by: Jens Axboe
20 Aug, 2010
1 commit
-
This adds annotations for RCU operations in core kernel components
Signed-off-by: Arnd Bergmann
Signed-off-by: Paul E. McKenney
Cc: Al Viro
Cc: Jens Axboe
Cc: Andrew Morton
Reviewed-by: Josh Triplett
16 Mar, 2010
1 commit
-
This flag is not used, so best discarded.
Signed-off-by: NeilBrown
--
Hi Jens,
I came across this recently - these are the only two occurances
of "GENHD_FL_DRIVERFS" in the kernel, so it cannot be needed.
NeilBrown
Signed-off-by: Jens Axboe
17 Feb, 2010
1 commit
-
Add __percpu sparse annotations to core subsystems.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.Signed-off-by: Tejun Heo
Reviewed-by: Christoph Lameter
Acked-by: Paul E. McKenney
Cc: Jens Axboe
Cc: linux-mm@kvack.org
Cc: Rusty Russell
Cc: Dipankar Sarma
Cc: Peter Zijlstra
Cc: Andrew Morton
Cc: Eric Biederman
11 Jan, 2010
1 commit
-
This fixes the sparse warning:
fs/ext4/super.c:2390:40: warning: symbol 'i' shadows an earlier one
fs/ext4/super.c:2368:22: originally declared hereUsing 'i' in a macro is dubious practice.
Signed-off-by: Stephen Hemminger
Signed-off-by: Jens Axboe
10 Nov, 2009
1 commit
-
While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger. Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe
07 Oct, 2009
1 commit
-
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.But Corrado Zoccolo reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan
--
Signed-off-by: Jens Axboe
05 Oct, 2009
1 commit
-
This reverts commit a9327cac440be4d8333bba975cbbf76045096275.
Corrado Zoccolo reports:
"with 2.6.32-rc1 I started getting the following strange output from
"iostat -kx 2":
Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)avg-cpu: %user %nice %system %iowait %steal %idle
10,70 0,00 3,16 15,75 0,00 70,38Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 18,22 0,00 0,67 0,01 14,77 0,02
43,94 0,01 10,53 39043915,03 2629219,87
sdb 60,89 9,68 50,79 3,04 1724,43 50,52
65,95 0,70 13,06 488437,47 2629219,87avg-cpu: %user %nice %system %iowait %steal %idle
2,72 0,00 0,74 0,00 0,00 96,53Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00avg-cpu: %user %nice %system %iowait %steal %idle
6,68 0,00 0,99 0,00 0,00 92,33Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00avg-cpu: %user %nice %system %iowait %steal %idle
4,40 0,00 0,73 1,47 0,00 93,40Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 4,00 0,00 3,00 0,00 28,00
18,67 0,06 19,50 333,33 100,00Global values for service time and utilization are garbage. For
interval values, utilization is always 100%, and service time is
higher than normal.I bisected it down to:
[a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
statistics of in_flight requests
and verified that reverting just that commit indeed solves the issue
on 2.6.32-rc1."So until this is debugged, revert the bad commit.
Signed-off-by: Jens Axboe