Eric Lee / smarc-fsl-linux-kernel

16 Jan, 2012

1 commit

b3c9dd182 Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...

Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)

Linus Torvalds
2012-01-16 04:24:45 +0800

07 Jan, 2012

1 commit

ece2ccb66 Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z Browse Code »

Al Viro
2012-01-07 12:15:54 +0800

04 Jan, 2012

3 commits

2c9ede55e switch device_get_devnode() and ->devnode() to umode_t * ... Browse Code »

both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:55 +0800
ff01bb483 fs: move code out of buffer.c ... Browse Code »

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:07 +0800
4752bc309 make register_disk() static ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:05 +0800

14 Dec, 2011

1 commit

09ac46c42 block: misc updates to blk_get_queue() ... Browse Code »

* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
failure instead of 0 / -errno or boolean. Update it such that it
returns %true on success and %false on failure.

* Make sure the caller checks for the return value.

* Separate out __blk_get_queue() which doesn't check whether @q is
dead and put it in blk.h. This will be used later.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:38 +0800

10 Nov, 2011

1 commit

d0985394e block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk" ... Browse Code »

This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10.

The commit introduced alias for block devices which is intended to be
used during logging although actual usage hasn't been committed yet.
This approach adds very limited benefit (raw log might be easier to
follow) which can be trivially implemented in userland but has a lot
of problems.

It is much worse than netif renames because it doesn't rename the
actual device but just adds conveninence name which isn't used
universally or enforced. Everything internal including device lookup
and sysfs still uses the internal name and nothing prevents two
devices from using conflicting alias - ie. sda can have sdb as its
alias.

This has been nacked by people working on device driver core, block
layer and kernel-userland interface and shouldn't have been
upstreamed. Revert it.

http://thread.gmane.org/gmane.linux.kernel/1155104
http://thread.gmane.org/gmane.linux.scsi/68632
http://thread.gmane.org/gmane.linux.scsi/69776

Signed-off-by: Tejun Heo
Acked-by: Greg Kroah-Hartman
Acked-by: Kay Sievers
Cc: "James E.J. Bottomley"
Cc: Nao Nishijima
Cc: Alan Cox
Cc: Al Viro
Signed-off-by: Jens Axboe

Tejun Heo
2011-11-10 16:03:55 +0800

05 Nov, 2011

2 commits

3d0a8d10c Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...

Fic up trivial header file includsion conflict in drivers/block/loop.c

Linus Torvalds
2011-11-05 08:22:14 +0800
b4fdcb02f Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c

Linus Torvalds
2011-11-05 08:06:58 +0800

19 Oct, 2011

1 commit

523e1d399 block: make gendisk hold a reference to its queue ... Browse Code »

The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:

Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
RIP: 0010:[] [] __lock_acquire+0x389/0x1d60
...
Call Trace:
[] lock_acquire+0x95/0x140
[] _raw_spin_lock+0x3b/0x50
[] bdi_lock_two+0x5c/0x70
[] bdev_inode_switch_bdi+0x4c/0xf0
[] __blkdev_put+0x11b/0x1d0
[] __blkdev_put+0x160/0x1d0
[] blkdev_put+0x5f/0x190
[] kill_block_super+0x4d/0x80
[] deactivate_locked_super+0x45/0x70
[] deactivate_super+0x4a/0x70
[] mntput_no_expire+0xed/0x130
[] sys_umount+0x7e/0x3a0
[] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue. If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:31:07 +0800

29 Aug, 2011

1 commit

a72c5e5eb [SCSI] genhd: add a new attribute "alias" in gendisk ... Browse Code »

This patch allows the user to set an "alias" of the disk via sysfs interface.

This patch only adds a new attribute "alias" in gendisk structure.
To show the alias instead of the device name in kernel messages,
we need to revise printk messages and use alias_name() in them.

Example:
(current) printk("disk name is %s\n", disk->disk_name);
(new) printk("disk name is %s\n", alias_name(disk));

Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
have an "alias" which length is up to 255 bytes. This attribute is write-once.

Suggested-by: James Bottomley
Suggested-by: Jon Masters
Signed-off-by: Nao Nishijima
Signed-off-by: James Bottomley

Nao Nishijima
2011-08-29 15:16:19 +0800

24 Aug, 2011

1 commit

d27769ec3 block: add GENHD_FL_NO_PART_SCAN ... Browse Code »

There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).

This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions. disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.

-v2: Updated to make it clear that only partition scan is suppressed
not partitioning itself as suggested by Kay Sievers.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-08-24 02:01:04 +0800

02 Aug, 2011

1 commit

f95fe9cfb block/genhd.c: remove useless cast in diskstats_show() ... Browse Code »

Remove the (unsigned long long) cast in diskstats_show() and adjusts the
seq_printf() format string to 'unsigned long'

diskstats_show() uses part_stat_read() to get the stats, which either
accesses the specified field in the struct disk_stats directly (non SMP)
or sums up the per CPU values in a variable of the same type as the field,
so in any case the result will have the same type and range as the
specified field which for all disk_stats entries is unsigned long

Also, for unsigned long ranges the output of %lu should be identical to
the one of %llu, so no change in the actual proc entry contents.

Signed-off-by: Herbert Poetzl
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Herbert Poetzl
2011-08-02 18:43:50 +0800

26 Jul, 2011

1 commit

096a705bb Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...

Linus Torvalds
2011-07-26 01:33:36 +0800

21 Jul, 2011

1 commit

57bdfbf9e block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu() ... Browse Code »

The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb).

Signed-off-by: Lai Jiangshan
Signed-off-by: Paul E. McKenney
Cc: Jens Axboe
Reviewed-by: Josh Triplett

Lai Jiangshan
2011-07-21 05:10:13 +0800

01 Jul, 2011

2 commits

85ef06d1d block: flush MEDIA_CHANGE from drivers on close(2) ... Browse Code »
43

Currently, only open(2) is defined as the 'clearing' point. It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events. Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage. Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers. disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex. disk_clear_events() can simply clear all events and the hint
can be boolean @flush.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-07-01 22:17:47 +0800
04bf7869c Merge branch 'for-linus' into for-3.1/core ... Browse Code »

Conflicts:
block/blk-throttle.c
block/cfq-iosched.c

Signed-off-by: Jens Axboe

Jens Axboe
2011-07-01 22:17:13 +0800

13 Jun, 2011

1 commit

9f5e48655 block:remove some spare spaces in genhd.c ... Browse Code »

Remove the end-of-line spaces in genhd.c.

Signed-off-by: Wanlong Gao
Signed-off-by: Jens Axboe

Wanlong Gao
2011-06-13 16:45:43 +0800

10 Jun, 2011

3 commits

fdd514e16 block: make disk_block_events() properly wait for work cancellation ... Browse Code »

disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.

Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.

This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.

https://bugzilla.kernel.org/show_bug.cgi?id=34662

Fix it by adding an outer mutex which protects both block count
manipulation and work cancellation.

-v2: Use outer mutex instead of bit waitqueue per Linus.

Signed-off-by: Tejun Heo
Tested-by: Sitsofe Wheeler
Reported-by: Sitsofe Wheeler
Reported-by: Borislav Petkov
Reported-by: Meelis Roos
Reported-by: Linus Torvalds
Cc: Andrew Morton
Cc: Jens Axboe
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-10 02:43:59 +0800
c3af54afb block: remove non-syncing __disk_block_events() and fold it into disk_block_events() ... Browse Code »

After the previous update to disk_check_events(), nobody is using
non-syncing __disk_block_events(). Remove @sync and, as this makes
__disk_block_events() virtually identical to disk_block_events(),
remove the underscore prefixed version.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-10 02:43:55 +0800
a9dce2a3b block: don't use non-syncing event blocking in disk_check_events() ... Browse Code »

This patch is part of fix for triggering of WARN_ON_ONCE() in
disk_clear_events() reported in bug#34662.

https://bugzilla.kernel.org/show_bug.cgi?id=34662

disk_clear_events() blocks events, schedules and flushes the event
work. It expects the work to have started execution on schedule and
finished on return from flush. WARN_ON_ONCE() triggers if the event
work hasn't executed as expected. This problem happens because
__disk_block_events() fails to guarantee that the event work item is
not in flight on return from the function in race-free manner. The
problem is two-fold and this patch addresses one of them.

When __disk_block_events() is called with @sync == %false, it bumps
event block count, calls cancel_delayed_work() and return. This makes
it impossible to guarantee that event polling is not in flight on
return from syncing __disk_block_events() - if the first blocker was
non-syncing, polling could still be in progress and later syncing ones
would assume that the first blocker already canceled it.

Making __disk_block_events() cancel_sync regardless of block count
isn't feasible either as it may race with forced event checking in
disk_clear_events().

As disk_check_events() is the only user of non-syncing
__disk_block_events(), updating it to directly cancel and schedule
event work is the easiest way to solve the issue.

Note that there's another bug in __disk_block_events() and this patch
doesn't fix the issue completely. Later patch will fix the other bug.

Signed-off-by: Tejun Heo
Tested-by: Sitsofe Wheeler
Reported-by: Sitsofe Wheeler
Reported-by: Borislav Petkov
Reported-by: Meelis Roos
Reported-by: Linus Torvalds
Cc: Andrew Morton
Cc: Jens Axboe
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-10 02:43:54 +0800

27 May, 2011

1 commit

75e3f3ee3 block: always allocate genhd->ev if check_events is implemented ... Browse Code »

9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
drivers which have inadequate ->check_events(). Combined with earlier
change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
userland), this enables using ->check_events() for internal processing
while avoiding enabling in-kernel block event polling which can lead
to infinite event loop.

Unfortunately, this made many drivers including floppy without any bit
set in disk->events and ->async_events in which case disk_add_events()
simply skipped allocation of disk->ev, which disables whole event
handling. As ->check_events() is still used during open processing
for revalidation, this can lead to open failure.

This patch always allocates disk->ev if ->check_events is implemented.
In the long term, it would make sense to simply include the event
structure inline into genhd as it's now used by virtually all block
devices.

Signed-off-by: Tejun Heo
Reported-by: Ondrej Zary
Reported-by: Alex Villacis Lasso
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2011-05-27 03:06:50 +0800

22 Apr, 2011

1 commit

7c88a168d block: don't propagate unlisted DISK_EVENTs to userland ... Browse Code »

DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and
internal event for revalidation of removeable devices. Some legacy
drivers don't implement proper event detection and continuously
generate events under certain circumstances. For example, ide-cd
generates media changed continuously if there's no media in the drive,
which can lead to infinite loop of events jumping back and forth
between the driver and userland event handler.

This patch updates disk event infrastructure such that it never
propagates events not listed in disk->events to userland. Those
events are processed the same for internal purposes but uevent
generation is suppressed.

This also ensures that userland only gets events which are advertised
in the @events sysfs node lowering risk of confusion.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-04-22 01:43:58 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

10 Mar, 2011

1 commit

facc31ddc block: Don't implicitly trigger event check on disk_unblock_events() ... Browse Code »

Currently, disk_unblock_events() implicitly kick event check if the
block count reaches zero. This behavior is not described in the
comment and hinders with future changes. Make the unblocker
explicitly check events by calling disk_check_events() as necessary.

This patch doesn't cause any behavior difference.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Kay Sievers

Tejun Heo
2011-03-10 02:54:27 +0800

05 Mar, 2011

1 commit

e83a46bbb Merge branch 'for-linus' of ../linux-2.6-block into block-for-2.6.39/core ... Browse Code »

This merge creates two set of conflicts. One is simple context
conflicts caused by removal of throtl_scheduled_delayed_work() in
for-linus and removal of throtl_shutdown_timer_wq() in
for-2.6.39/core.

The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
call directly into q->request_fn() __blk_run_queue()) in for-linus
crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
isn't trivial but the resolution is straight-forward.

* __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
should be called with @force_kblockd set to %true.

* elv_insert() in blk_kick_flush() should use
%ELEVATOR_INSERT_REQUEUE.

Both changes are to avoid invoking ->request_fn() directly from
request completion path and closely match the changes in the commit
255bb490c8.

Signed-off-by: Tejun Heo

Tejun Heo
2011-03-05 02:09:02 +0800

03 Mar, 2011

1 commit

53f22956e block/genhd: Change some numerals into macros ... Browse Code »

Rename the numerals in the diskstats_show() into the macros.

Cc: Jens Axboe
Signed-off-by: Liu Yuan
Signed-off-by: Jens Axboe

Liu Yuan
2011-03-03 00:00:15 +0800

24 Feb, 2011

1 commit

93b270f76 Fix over-zealous flush_disk when changing device size. ... Browse Code »

There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.

In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.

In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.

flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.

invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.

invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.

So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.

flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.

dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.

md does use check_disk_size_change and so is affected.

This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.

Cc: stable@kernel.org
Acked-by: Jeff Moyer
Cc: Andrew Patterson
Cc: Jens Axboe
Signed-off-by: NeilBrown

NeilBrown
2011-02-24 14:25:47 +0800

13 Jan, 2011

1 commit

81c5e2ae3 Merge branch 'for-2.6.38/event-handling' into for-2.6.38/core Browse Code »

Jens Axboe
2011-01-13 21:47:54 +0800

07 Jan, 2011

1 commit

6c23a9681 block: add internal hd part table references ... Browse Code »

We can't use krefs since it's apparently restricted to very basic
reference counting.

This reverts commit e4a683c8.

Signed-off-by: Jens Axboe

Jens Axboe
2011-01-07 15:43:37 +0800

05 Jan, 2011

1 commit

09e099d4b block: fix accounting bug on cross partition merges ... Browse Code »

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.

| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.

Signed-off-by: Jerome Marchand
Signed-off-by: Yasuaki Ishimatsu
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jerome Marchand
2011-01-05 23:57:38 +0800

17 Dec, 2010

5 commits

e61eb2e93 fs/block: type signature of major_to_index(int) to major_to_index(unsigned) ... Browse Code »

The major/minor device numbers are always defined and used as `unsigned'.

Signed-off-by: Yang Zhang
Signed-off-by: Jens Axboe

Yang Zhang
2010-12-17 16:00:18 +0800
b9f985b6e block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) ... Browse Code »

Signed-off-by: Yang Zhang
Signed-off-by: Jens Axboe

Yang Zhang
2010-12-17 15:58:36 +0800
77ea887e4 implement in-kernel gendisk events handling ... Browse Code »

Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.

* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.

* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.

* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).

This patch implements framework for in-kernel disk event handling,
which includes media presence polling.

* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.

* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.

* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.

* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.

* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.

* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.

* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Cc: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2010-12-17 00:53:38 +0800
d2bf1b672 block: move register_disk() and del_gendisk() to block/genhd.c ... Browse Code »

There's no reason for register_disk() and del_gendisk() to be in
fs/partitions/check.c. Move both to genhd.c. While at it, collapse
unlink_gendisk(), which was artificially in a separate function due to
genhd.c / check.c split, into del_gendisk().

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2010-12-17 00:53:38 +0800
dddd9dc34 block: kill genhd_media_change_notify() ... Browse Code »

There's no user of the facility. Kill it.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2010-12-17 00:53:38 +0800

25 Oct, 2010

1 commit

f253b86b4 Revert "block: fix accounting bug on cross partition merges" ... Browse Code »

This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327.

Conflicts:

include/linux/genhd.h

It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.

Signed-off-by: Jens Axboe

Jens Axboe
2010-10-25 04:06:02 +0800

23 Oct, 2010

2 commits

b9da05710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits)
driver core: Display error codes when class suspend fails
Driver core: Add section count to memory_block struct
Driver core: Add mutex for adding/removing memory blocks
Driver core: Move find_memory_block routine
hpilo: Despecificate driver from iLO generation
driver core: Convert link_mem_sections to use find_memory_block_hinted.
driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted.
kobject: Introduce kset_find_obj_hinted.
driver core: fix build for CONFIG_BLOCK not enabled
driver-core: base: change to new flag variable
sysfs: only access bin file vm_ops with the active lock
sysfs: Fail bin file mmap if vma close is implemented.
FW_LOADER: fix kconfig dependency warning on HOTPLUG
uio: Statically allocate uio_class and use class .dev_attrs.
uio: Support 2^MINOR_BITS minors
uio: Cleanup irq handling.
uio: Don't clear driver data
uio: Fix lack of locking in init_uio_class
SYSFS: Allow boot time switching between deprecated and modern sysfs layout
driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices
...

Linus Torvalds
2010-10-23 10:36:42 +0800
e52eec13c SYSFS: Allow boot time switching between deprecated and modern sysfs layout ... Browse Code »

I have some systems which need legacy sysfs due to old tools that are
making assumptions that a directory can never be a symlink to another
directory, and it's a big hazzle to compile separate kernels for them.

This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
that can be switched on/off the kernel command line. This way
the same binary can be used in both cases with just a option
on the command line.

The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
the default. I kept the weird name to not break existing
config files.

Also the compat code can be still completely disabled by undefining
CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
care of this now instead of lots of ifdefs. This makes the code
look nicer.

v2: This is an updated version on top of Kay's patch to only
handle the block devices. I tested it on my old systems
and that seems to work.

Cc: axboe@kernel.dk
Signed-off-by: Andi Kleen
Cc: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2010-10-23 01:16:43 +0800

19 Oct, 2010

1 commit

7681bfeec block: fix accounting bug on cross partition merges ... Browse Code »

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.

| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.

| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.

Signed-off-by: Yasuaki Ishimatsu
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Yasuaki Ishimatsu
2010-10-19 15:07:02 +0800