Eric Lee / linux-smarc-t335x-v3.2

10 Sep, 2011

1 commit

94007751b Avoid dereferencing a 'request_queue' after last close. ... Browse Code »

On the last close of an 'md' device which as been stopped, the device
is destroyed and in particular the request_queue is freed. The free
is done in a separate thread so it might happen a short time later.

__blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
called.

Since commit f758eeabeb96f878c860e8f110f94ec8820822a9
bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
inside a request_queue, to get a spin lock. This causes the last
close on an md device to sometime take a spin_lock which lives in
freed memory - which results in an oops.

So move the called to bdev_inode_switch_bdi before the call to
->release.

Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Andrew Morton
Cc: Wu Fengguang
Acked-by: Wu Fengguang
Cc: stable@kernel.org
Signed-off-by: NeilBrown

NeilBrown
2011-09-10 15:20:21 +0800

02 Aug, 2011

1 commit

da5aa861b fix block device fallout from ->fsync() changes ... Browse Code »

blkdev_fsync() needs to write pages in pagecache...

Signed-off-by: Rafael J. Wysocki
Signed-off-by: Al Viro

Rafael J. Wysocki
2011-08-02 09:33:47 +0800

01 Aug, 2011

1 commit

782b94cdf block: initialise bd_super in bdget() ... Browse Code »

bd_super is currently reset to NULL in kill_block_super() so we rely on previous
users of the block_device object to initialise this value for the next user.
This quirk was exposed on RHEL5 when a third party filesystem did not always use
kill_block_super() and therefore bd_super wasn't being reset when a block_device
object was recycled within the cache. This may not be a problem upstream but
makes sense to be defensive.

Signed-off-by: Lachlan McIlroy
Reviewed-by: Eric Sandeen
Signed-off-by: Al Viro

Lachlan McIlroy
2011-08-01 13:57:44 +0800

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

26 Jul, 2011

1 commit

096a705bb Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...

Linus Torvalds
2011-07-26 01:33:36 +0800

21 Jul, 2011

2 commits

02c24a821 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers ... Browse Code »

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara
Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:59 +0800
06222e491 fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek ... Browse Code »
3

This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:58 +0800

01 Jul, 2011

1 commit

85ef06d1d block: flush MEDIA_CHANGE from drivers on close(2) ... Browse Code »

Currently, only open(2) is defined as the 'clearing' point. It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events. Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage. Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers. disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex. disk_clear_events() can simply clear all events and the hint
can be boolean @flush.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-07-01 22:17:47 +0800

13 Jun, 2011

1 commit

d4c208b86 block: use the passed in @bdev when claiming if partno is zero ... Browse Code »

6b4517a791 (block: implement bd_claiming and claiming block)
introduced claiming block to support O_EXCL blkdev opens properly.

bd_start_claiming() looks up the part 0 bdev and starts claiming
block. The function assumed that there is only one part 0 bdev and
always used bdget_disk(disk, 0) to look it up; unfortunately, this
isn't true for some drivers (floppy) which use multiple block devices
to denote different operating parameters for the same physical device.
There can be multiple part 0 bdev's for the same device number.

This incorrect assumption caused the wrong bdev to be used during
claiming leading to unbalanced bd_holders as reported in the following
bug.

https://bugzilla.kernel.org/show_bug.cgi?id=28522

This patch updates bd_start_claiming() such that it uses the bdev
specified as argument if its partno is zero.

Note that this means that different bdev's can be used for the same
device and O_EXCL check can be effectively bypassed. It has always
been broken that way and floppy is fortunately on its way out. Leave
that breakage alone.

Signed-off-by: Tejun Heo
Reported-by: Alex Villacis Lasso
Tested-by: Alex Villacis Lasso
Cc: stable@kernel.org # >= v2.6.36
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-13 18:45:48 +0800

08 Jun, 2011

1 commit

f758eeabe writeback: split inode_wb_list_lock into bdi_writeback.list_lock ... Browse Code »

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Wu Fengguang

Christoph Hellwig
2011-06-08 08:25:21 +0800

01 Jun, 2011

1 commit

4c49ff3fe block: blkdev_get() should access ->bd_disk only after success ... Browse Code »

d4dc210f69 (block: don't block events on excl write for non-optical
devices) added dereferencing of bdev->bd_disk to test
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
%NULL if open failed which can lead to an oops.

Test the flag after testing open was successful, not before.

Signed-off-by: Tejun Heo
Reported-by: David Miller
Tested-by: David Miller
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-01 14:28:47 +0800

23 May, 2011

1 commit

7e69723fe block: move bd_set_size() above rescan_partitions() in __blkdev_get() ... Browse Code »

02e352287a4 (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check. As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

sda: detected capacity change from 0 to 146815737856
sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a4 (block: rescan
partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo
Reported-by: Tony Luck
Tested-by: Tony Luck
Cc: stable@kernel.org

Stable note: 2.6.39 only.
Signed-off-by: Jens Axboe

Tejun Heo
2011-05-23 19:26:07 +0800

22 Apr, 2011

2 commits

d4dc210f6 block: don't block events on excl write for non-optical devices ... Browse Code »

Disk event code automatically blocks events on excl write. This is
primarily to avoid issuing polling commands while burning is in
progress. This behavior doesn't fit other types of devices with
removeable media where polling commands don't have adverse side
effects and door locking usually doesn't exist.

This patch introduces new genhd flag which controls the auto-blocking
behavior and uses it to enable auto-blocking only on optical devices.

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-04-22 02:54:46 +0800
1196f8b81 block: rescan partitions on invalidated devices on -ENOMEDIA too ... Browse Code »

__blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
which leads to ghost partition devices lingering after medimum removal
is known to both the kernel and userland. The behavior also creates a
subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
there's no medium, clears the ghots partitions, which is exploited to
work around the problem from userland.

Fix it by updating __blkdev_get() to issue partition rescan after
-ENOMEDIA too.

This was reported in the following bz.

https://bugzilla.kernel.org/show_bug.cgi?id=13029

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo
Reported-by: David Zeuthen
Reported-by: Martin Pitt
Reported-by: Kay Sievers
Tested-by: Kay Sievers
Cc: Alan Cox
Signed-off-by: Jens Axboe

Tejun Heo
2011-04-22 02:54:45 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

25 Mar, 2011

3 commits

d39dd11c3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: simplify iget & friends
fs: pull inode->i_lock up out of writeback_single_inode
fs: rename inode_lock to inode_hash_lock
fs: move i_wb_list out from under inode_lock
fs: move i_sb_list out from under inode_lock
fs: remove inode_lock from iput_final and prune_icache
fs: Lock the inode LRU list separately
fs: factor inode disposal
fs: protect inode->i_state with inode->i_lock
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
autofs4 - remove autofs4_lock
autofs4 - fix d_manage() return on rcu-walk
autofs4 - fix autofs4_expire_indirect() traversal
autofs4 - fix dentry leak in autofs4_expire_direct()
autofs4 - reinstate last used update on access
vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

Linus Torvalds
2011-03-25 10:01:30 +0800
a66979aba fs: move i_wb_list out from under inode_lock ... Browse Code »

Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:17:51 +0800
250df6ed2 fs: protect inode->i_state with inode->i_lock ... Browse Code »

Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:16:31 +0800

19 Mar, 2011

1 commit

4345caba3 block: NULL dereference on error path in __blkdev_get() ... Browse Code »

"disk" is always NULL when we goto out. There was a check for this
before, but it was removed in 69e02c59a7d9 "block: Don't check events
while open is in progress".

Signed-off-by: Dan Carpenter
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Dan Carpenter
2011-03-19 20:53:31 +0800

10 Mar, 2011

5 commits

4c63f5646 Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core ... Browse Code »

Conflicts:
block/blk-core.c
block/blk-flush.c
drivers/md/raid1.c
drivers/md/raid10.c
drivers/md/raid5.c
fs/nilfs2/btnode.c
fs/nilfs2/mdt.c

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-10 15:58:35 +0800
7eaceacca block: remove per-queue plugging ... Browse Code »

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-10 15:52:07 +0800
69e02c59a block: Don't check events while open is in progress ... Browse Code »

Not all block drivers clear events immediately after reporting. Some
do so in ->revalidate_disk() or other steps during ->open(). There is
a slim chance event poll may happen between the clearing event check
from check_disk_change() and the actual clearing of the events which
would result in spurious events.

Block event checks while block device open is in progress. There is
no need to kick explicit event check afterwards as events are always
checked during open.

-v2: The original patch could have called disk_unblock_events() with
an already released or %NULL @disk causing oops. Fixed by making
sure references are put after disk_unblock_events() is called.
It also makes the error path of __blkdev_get() a bit simpler.
This problem was reported by Jens.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Kay Sievers

Tejun Heo
2011-03-10 02:54:27 +0800
6936217cc block: Don't check events on close unless it was blocked ... Browse Code »

The block event mechanism currently always checks events when the
device is being closed regardless of the open mode. The intention was
to allow detection of EJECT_REQUEST when a device is closed whether
disk event polling is enabled or not.

This is unnecessary as, for devices of interest, events are checked
from either userland or kernel and in the former case ->check_events()
is performed on open of each poll attempt anyway. Furthermore, this
unconditional event check on close makes the code susceptible to event
loop if the block driver doesn't clear reported events correctly - an
event triggers userland to open and close the device which in turn
causes another event, rinse and repeat.

Check events on close only if it was blocked by excl write open.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Kay Sievers

Tejun Heo
2011-03-10 02:54:27 +0800
facc31ddc block: Don't implicitly trigger event check on disk_unblock_events() ... Browse Code »

Currently, disk_unblock_events() implicitly kick event check if the
block count reaches zero. This behavior is not described in the
comment and hinders with future changes. Make the unblocker
explicitly check events by calling disk_check_events() as necessary.

This patch doesn't cause any behavior difference.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Kay Sievers

Tejun Heo
2011-03-10 02:54:27 +0800

01 Mar, 2011

1 commit

e6eb5ce1b fs/block_dev.c: fix new kernel-doc warning ... Browse Code »

Fix new kernel-doc warning in fs/block_dev.c:

Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty'

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-03-01 10:08:31 +0800

26 Feb, 2011

1 commit

638691a7a Merge branch 'for-linus' of git://neil.brown.name/md ... Browse Code »

* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover

Linus Torvalds
2011-02-26 03:13:26 +0800

25 Feb, 2011

1 commit

e7407d161 block: bd_link_disk_holder() should hold on to holder_dir ... Browse Code »

The new implementation of bd_link_disk_holder() added by 49731baa41d
(block: restore multiple bd_link_disk_holder() support) didn't get an
extra reference for the holder_dir kobject of the slave bdev; however,
bdev kills holder_dir on removal, not release, so if the slave bdev is
removed while there are holder links, the holder_dir will be destroyed
while there still are holder links, which leads to oops later when
bd_unlink_disk_order() tries to remove those links.

Make bd_link_disk_holder() grab an extra reference for the slave's
holder_dir and put it in bd_unlink_disk_holder().

Signed-off-by: Tejun Heo
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw"
Tested-by: "Hawrylewicz Czarnowski, Przemyslaw"
Cc: Neil Brown
Cc: Jens Axboe
Signed-off-by: Linus Torvalds

Tejun Heo
2011-02-25 00:55:55 +0800

24 Feb, 2011

1 commit

93b270f76 Fix over-zealous flush_disk when changing device size. ... Browse Code »

There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.

In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.

In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.

flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.

invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.

invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.

So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.

flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.

dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.

md does use check_disk_size_change and so is affected.

This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.

Cc: stable@kernel.org
Acked-by: Jeff Moyer
Cc: Andrew Patterson
Cc: Jens Axboe
Signed-off-by: NeilBrown

NeilBrown
2011-02-24 14:25:47 +0800

17 Feb, 2011

1 commit

e51900f7d block: revert block_dev read-only check ... Browse Code »

This reverts commit 75f1dc0d076d ("block: check bdev_read_only() from
blkdev_get()"). That commit added stricter checking to make sure
devices that were being used read-only were actually opened in that
mode.

It turns out that the change breaks a bunch of kernel code that opens
block devices. Affected systems include dm, md, and the loop device.
Because strict checking for read-only opens of block devices was not
done before this, the code that opens the devices was opening them
read-write even if they were being used read-only. Auditing all that
code will take time, and new userspace packages for dm, mdadm, etc.
will also be required.

Signed-off-by: Chuck Ebbert
Signed-off-by: Linus Torvalds

Chuck Ebbert
2011-02-17 08:48:13 +0800

15 Jan, 2011

1 commit

49731baa4 block: restore multiple bd_link_disk_holder() support ... Browse Code »

Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum. dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.

Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking. The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.

While at it, note that this facility should not be used by anyone else
than the current ones. Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.

Signed-off-by: Tejun Heo
Reported-by: Milan Broz
Cc: Jun'ichi Nomura
Cc: Neil Brown
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-01-15 01:44:22 +0800

14 Jan, 2011

1 commit

275220f0f Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...

Linus Torvalds
2011-01-14 02:45:01 +0800

13 Jan, 2011

1 commit

c74a1cbb3 pass default dentry_operations to mount_pseudo() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2011-01-13 09:03:43 +0800

07 Jan, 2011

1 commit

fa0d7e3de fs: icache RCU free inodes ... Browse Code »

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:26 +0800

17 Dec, 2010

1 commit

77ea887e4 implement in-kernel gendisk events handling ... Browse Code »

Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.

* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.

* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.

* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).

This patch implements framework for in-kernel disk event handling,
which includes media presence polling.

* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.

* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.

* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.

* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.

* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.

* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.

* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Cc: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2010-12-17 00:53:38 +0800

18 Nov, 2010

1 commit

451a3c24b BKL: remove extraneous #include <smp_lock.h> ... Browse Code »

The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann
Signed-off-by: Linus Torvalds

Arnd Bergmann
2010-11-18 00:59:32 +0800

13 Nov, 2010

5 commits

d4d776299 block: clean up blkdev_get() wrappers and their users ... Browse Code »

After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.

* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo
Cc: Philipp Reisner
Cc: Neil Brown
Cc: Mike Snitzer
Cc: Joern Engel
Cc: Chris Mason
Cc: Jan Kara
Cc: "Theodore Ts'o"
Cc: KONISHI Ryusuke
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro

Tejun Heo
2010-11-13 18:55:18 +0800
75f1dc0d0 block: check bdev_read_only() from blkdev_get() ... Browse Code »

bdev read-only status can be queried using bdev_read_only() and may
change while the device is being opened. Enforce it by checking it
from blkdev_get() after open succeeds.

This makes bdev_read_only() check in open_bdev_exclusive() and
fsg_lun_open() unnecessary. Drop them.

Signed-off-by: Tejun Heo
Cc: David Brownell
Cc: linux-usb@vger.kernel.org

Tejun Heo
2010-11-13 18:55:17 +0800
6a027eff6 block: reorganize claim/release implementation ... Browse Code »

With claim/release rolled into blkdev_get/put(), there's no reason to
keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
separate functions. It only makes the code difficult to follow.
Collapse them into blkdev_get/put(). This will ease future changes
around claim/release.

Signed-off-by: Tejun Heo

Tejun Heo
2010-11-13 18:55:17 +0800
e525fd89d block: make blkdev_get/put() handle exclusive access ... Browse Code »

Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.

Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo
Acked-by: Neil Brown
Acked-by: Ryusuke Konishi
Acked-by: Mike Snitzer
Acked-by: Philipp Reisner
Cc: Peter Osterlund
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Jan Kara
Cc: Andrew Morton
Cc: Andreas Dilger
Cc: "Theodore Ts'o"
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Alex Elder
Cc: Christoph Hellwig
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen
Cc: Scott Branden
Cc: Chris Mason
Cc: Steven Whitehouse
Cc: Dave Kleikamp
Cc: Joern Engel
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro

Tejun Heo
2010-11-13 18:55:17 +0800
e09b457bd block: simplify holder symlink handling ... Browse Code »

Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.

Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface. This is a step for cleaning up
bd_claim/release. This patch makes dm-table slightly more complex but
it will be simplified again with further changes.

Signed-off-by: Tejun Heo
Acked-by: Neil Brown
Acked-by: Mike Snitzer
Cc: dm-devel@redhat.com

Tejun Heo
2010-11-13 18:55:17 +0800