Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

18 Dec, 2012

1 commit

965c8e59c lseek: the "whence" argument is called "whence" ... Browse Code »
13

But the kernel decided to call it "origin" instead. Fix most of the
sites.

Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-12-18 09:15:12 +0800

09 Dec, 2012

1 commit

684c9aaeb vfs: fix O_DIRECT read past end of block device ... Browse Code »

The direct-IO write path already had the i_size checks in mm/filemap.c,
but it turns out the read path did not, and removing the block size
checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make
private to fs/buffer.c") removed the magic "shrink IO to past the end of
the device" code there.

Fix it by truncating the IO to the size of the block device, like the
write path already does.

NOTE! I suspect the write path would be *much* better off doing it this
way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The
mm/filemap.c code is extremely hard to follow, and has various
conditionals on the target being a block device (ie the flag passed in
to 'generic_write_checks()', along with a conditional update of the
inode timestamp etc).

It is also quite possible that we should treat this whole block device
size as a "s_maxbytes" issue, and try to make the logic even more
generic. However, in the meantime this is the fairly minimal targeted
fix.

Noted by Milan Broz thanks to a regression test for the cryptsetup
reencrypt tool.

Reported-and-tested-by: Milan Broz
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-12-09 00:28:26 +0800

30 Nov, 2012

2 commits

bbec0270b blkdev_max_block: make private to fs/buffer.c ... Browse Code »

We really don't want to look at the block size for the raw block device
accesses in fs/block-dev.c, because it may be changing from under us.
So get rid of the max_block logic entirely, since the caller should
already have done it anyway.

That leaves the only user of this function in fs/buffer.c, so move the
whole function there and make it static.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-11-30 09:48:12 +0800
1e8b33328 blockdev: remove bd_block_size_semaphore again ... Browse Code »

This reverts the block-device direct access code to the previous
unlocked code, now that fs/buffer.c no longer needs external locking.

With this, fs/block_dev.c is back to the original version, apart from a
whitespace cleanup that I didn't want to revert.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-11-30 02:52:19 +0800

29 Oct, 2012

1 commit

1a25b1c4c Lock splice_read and splice_write functions ... Browse Code »

Functions generic_file_splice_read and generic_file_splice_write access
the pagecache directly. For block devices these functions must be locked
so that block size is not changed while they are in progress.

This patch is an additional fix for commit b87570f5d349 ("Fix a crash
when block device is read and block size is changed at the same time")
that locked aio_read, aio_write and mmap against block size change.

Signed-off-by: Mikulas Patocka
Signed-off-by: Linus Torvalds

Mikulas Patocka
2012-10-29 01:59:37 +0800

11 Oct, 2012

1 commit

ce40be7a8 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO update from Jens Axboe:
"Core block IO bits for 3.7. Not a huge round this time, it contains:

- First series from Kent cleaning up and generalizing bio allocation
and freeing.

- WRITE_SAME support from Martin.

- Mikulas patches to prevent O_DIRECT crashes when someone changes
the block size of a device.

- Make bio_split() work on data-less bio's (like trim/discards).

- A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton. It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
block: makes bio_split support bio without data
scatterlist: refactor the sg_nents
scatterlist: add sg_nents
fs: fix include/percpu-rwsem.h export error
percpu-rw-semaphore: fix documentation typos
fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
blockdev: turn a rw semaphore into a percpu rw semaphore
Fix a crash when block device is read and block size is changed at the same time
block: fix request_queue->flags initialization
block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
block: ioctl to zero block ranges
block: Make blkdev_issue_zeroout use WRITE SAME
block: Implement support for WRITE SAME
block: Consolidate command flag and queue limit checks for merges
block: Clean up special command handling logic
block/blk-tag.c: Remove useless kfree
block: remove the duplicated setting for congestion_threshold
block: reject invalid queue attribute values
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
...

Linus Torvalds
2012-10-11 08:04:23 +0800

26 Sep, 2012

3 commits

3eab7315c fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared ... Browse Code »

blkdev_mmap() isn't used outside of fs/block_dev.c, mark it as
static.

Reported-by: Fengguang Wu
Signed-off-by: Jens Axboe

Fengguang Wu
2012-09-26 15:57:55 +0800
62ac665ff blockdev: turn a rw semaphore into a percpu rw semaphore ... Browse Code »

This avoids cache line bouncing when many processes lock the semaphore
for read.

New percpu lock implementation

The lock consists of an array of percpu unsigned integers, a boolean
variable and a mutex.

When we take the lock for read, we enter rcu read section, check for a
"locked" variable. If it is false, we increase a percpu counter on the
current cpu and exit the rcu section. If "locked" is true, we exit the
rcu section, take the mutex and drop it (this waits until a writer
finished) and retry.

Unlocking for read just decreases percpu variable. Note that we can
unlock on a difference cpu than where we locked, in this case the
counter underflows. The sum of all percpu counters represents the number
of processes that hold the lock for read.

When we need to lock for write, we take the mutex, set "locked" variable
to true and synchronize rcu. Since RCU has been synchronized, no
processes can create new read locks. We wait until the sum of percpu
counters is zero - when it is, there are no readers in the critical
section.

Signed-off-by: Mikulas Patocka
Signed-off-by: Jens Axboe

Mikulas Patocka
2012-09-26 13:46:43 +0800
b87570f5d Fix a crash when block device is read and block size is changed at the same time ... Browse Code »

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.

Signed-off-by: Mikulas Patocka
Signed-off-by: Jens Axboe

Mikulas Patocka
2012-09-26 13:46:40 +0800

02 Aug, 2012

1 commit

53362a05a fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices ... Browse Code »

For regular file, write operaion used blk_plug function.But for block
file,write operation did not use blk_plug.
This patch is also for write-cache mode for block-device.

Signed-off-by: Jianpeng Ma
Reviewed-by: NeilBrown
Signed-off-by: Jens Axboe

Jianpeng Ma
2012-08-02 15:50:39 +0800

23 Jul, 2012

1 commit

5c0d6b60a vfs: Create function for iterating over block devices ... Browse Code »
2

Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-07-23 03:58:45 +0800

29 May, 2012

1 commit

90324cc1b Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally

Linus Torvalds
2012-05-29 00:54:45 +0800

11 May, 2012

1 commit

080399aaa block: don't mark buffers beyond end of disk as mapped ... Browse Code »

Hi,

We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk. It can
easily be reproduced by doing the following:

[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s

In dmesg, you'll find the following:

squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 43.106012] attempt to access beyond end of device
[ 43.106029] loop0: rw=0, want=277410, limit=277408
[ 43.106039] Buffer I/O error on device loop0, logical block 138704
[ 43.106053] attempt to access beyond end of device
[ 43.106057] loop0: rw=0, want=277412, limit=277408
[ 43.106061] Buffer I/O error on device loop0, logical block 138705
[ 43.106066] attempt to access beyond end of device
[ 43.106070] loop0: rw=0, want=277414, limit=277408
[ 43.106073] Buffer I/O error on device loop0, logical block 138706
[ 43.106078] attempt to access beyond end of device
[ 43.106081] loop0: rw=0, want=277416, limit=277408
[ 43.106085] Buffer I/O error on device loop0, logical block 138707
[ 43.106089] attempt to access beyond end of device
[ 43.106093] loop0: rw=0, want=277418, limit=277408
[ 43.106096] Buffer I/O error on device loop0, logical block 138708
[ 43.106101] attempt to access beyond end of device
[ 43.106104] loop0: rw=0, want=277420, limit=277408
[ 43.106108] Buffer I/O error on device loop0, logical block 138709
[ 43.106112] attempt to access beyond end of device
[ 43.106116] loop0: rw=0, want=277422, limit=277408
[ 43.106120] Buffer I/O error on device loop0, logical block 138710
[ 43.106124] attempt to access beyond end of device
[ 43.106128] loop0: rw=0, want=277424, limit=277408
[ 43.106131] Buffer I/O error on device loop0, logical block 138711
[ 43.106135] attempt to access beyond end of device
[ 43.106139] loop0: rw=0, want=277426, limit=277408
[ 43.106143] Buffer I/O error on device loop0, logical block 138712
[ 43.106147] attempt to access beyond end of device
[ 43.106151] loop0: rw=0, want=277428, limit=277408
[ 43.106154] Buffer I/O error on device loop0, logical block 138713
[ 43.106158] attempt to access beyond end of device
[ 43.106162] loop0: rw=0, want=277430, limit=277408
[ 43.106166] attempt to access beyond end of device
[ 43.106169] loop0: rw=0, want=277432, limit=277408
...
[ 43.106307] attempt to access beyond end of device
[ 43.106311] loop0: rw=0, want=277470, limit=2774

Squashfs manages to read in the end block(s) of the disk during the
mount operation. Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped. Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above. I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.

Cheers,
Jeff

Signed-off-by: Jeff Moyer
Acked-by: Nick Piggin

--

Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe

Jeff Moyer
2012-05-11 22:42:14 +0800

06 May, 2012

1 commit

dbd5768f8 vfs: Rename end_writeback() to clear_inode() ... Browse Code »

After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:41 +0800

24 Mar, 2012

1 commit

b502bd115 magic.h: move some FS magic numbers into magic.h ... Browse Code »

- Move open-coded filesystem magic numbers into magic.h

- Rearrange magic.h so that the filesystem-related constants are grouped
together.

Signed-off-by: Muthukumar R
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Muthu Kumar
2012-03-24 07:58:31 +0800

20 Mar, 2012

1 commit

16c0cfa42 Merge branch 'stable/cleancache.v13' into linux-next ... Browse Code »

* stable/cleancache.v13:
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/

Konrad Rzeszutek Wilk
2012-03-20 00:12:19 +0800

02 Mar, 2012

1 commit

fe316bf2d block: Fix NULL pointer dereference in sd_revalidate_disk ... Browse Code »

Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.

However it ends up calling driver's revalidate_disk without open
and could cause oops.

In the case of SCSI:

process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device

rescan_partitions
sd_revalidate_disk

Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052

This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.

Reported-by: Huajun Li
Signed-off-by: Jun'ichi Nomura
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jun'ichi Nomura
2012-03-02 17:38:33 +0800

24 Jan, 2012

1 commit

3167760f8 mm: cleancache: s/flush/invalidate/ ... Browse Code »

Per akpm suggestions alter the use of the term flush to be
invalidate. The next patch will do this across all MM.

This change is completely cosmetic.

[v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

Signed-off-by: Dan Magenheimer
Cc: Kamezawa Hiroyuki
Cc: Jan Beulich
Reviewed-by: Seth Jennings
Cc: Jeremy Fitzhardinge
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Nitin Gupta
Cc: Matthew Wilcox
Cc: Chris Mason
Cc: Rik Riel
Cc: Andrew Morton
[v10: Fixed fs: move code out of buffer.c conflict change]
Signed-off-by: Konrad Rzeszutek Wilk

Dan Magenheimer
2012-01-24 05:06:24 +0800

13 Jan, 2012

1 commit

87192a2a4 vfs: cache request_queue in struct block_device ... Browse Code »

This makes it possible to get from the inode to the request_queue with one
less cache miss. Used in followon optimization.

The livetime of the pointer is the same as the gendisk.

This assumes that the queue will always stay the same in the gendisk while
it's visible to block_devices. I think that's safe correct?

Signed-off-by: Andi Kleen
Acked-by: Jeff Moyer
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-01-13 12:13:12 +0800

11 Jan, 2012

1 commit

ace8577ae block_dev: Suppress bdev_cache_init() kmemleak warninig ... Browse Code »

Kmemleak reports the following warning in bdev_cache_init()
[ 0.003738] kmemleak: Object 0xffff880153035200 (size 256):
[ 0.003823] kmemleak: comm "swapper/0", pid 0, jiffies 4294667299
[ 0.003909] kmemleak: min_count = 1
[ 0.003988] kmemleak: count = 0
[ 0.004066] kmemleak: flags = 0x1
[ 0.004144] kmemleak: checksum = 0
[ 0.004224] kmemleak: backtrace:
[ 0.004303] [] kmemleak_alloc+0x21/0x3e
[ 0.004446] [] kmem_cache_alloc+0xca/0x1dc
[ 0.004592] [] alloc_vfsmnt+0x1f/0x198
[ 0.004736] [] vfs_kern_mount+0x36/0xd2
[ 0.004879] [] kern_mount_data+0x18/0x32
[ 0.005025] [] bdev_cache_init+0x51/0x81
[ 0.005169] [] vfs_caches_init+0x101/0x10d
[ 0.005313] [] start_kernel+0x344/0x383
[ 0.005456] [] x86_64_start_reservations+0xae/0xb2
[ 0.005602] [] x86_64_start_kernel+0x102/0x111
[ 0.005747] [] 0xffffffffffffffff
[ 0.008653] kmemleak: Trying to color unknown object at 0xffff880153035220 as Grey
[ 0.008754] Pid: 0, comm: swapper/0 Not tainted 3.3.0-rc0-dbg-04200-g8180888-dirty #888
[ 0.008856] Call Trace:
[ 0.008934] [] ? find_and_get_object+0x44/0x118
[ 0.009023] [] paint_ptr+0x57/0x8f
[ 0.009109] [] kmemleak_not_leak+0x23/0x42
[ 0.009195] [] bdev_cache_init+0x72/0x81
[ 0.009282] [] vfs_caches_init+0x101/0x10d
[ 0.009368] [] start_kernel+0x344/0x383
[ 0.009466] [] x86_64_start_reservations+0xae/0xb2
[ 0.009555] [] ? early_idt_handlers+0x140/0x140
[ 0.009643] [] x86_64_start_kernel+0x102/0x111

due to attempt to mark pointer to `struct vfsmount' as a gray object, which
is embedded into `struct mount' returned from alloc_vfsmnt().

Make `bd_mnt' static, avoiding need to tell kmemleak to mark it gray, as
suggested by Al Viro.

Signed-off-by: Sergey Senozhatsky
Signed-off-by: Al Viro

Sergey Senozhatsky
2012-01-11 02:08:55 +0800

04 Jan, 2012

3 commits

ff01bb483 fs: move code out of buffer.c ... Browse Code »

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:07 +0800
6b520e056 vfs: fix the stupidity with i_dentry in inode destructors ... Browse Code »

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:40 +0800
f47ec3f28 trim fs/internal.h ... Browse Code »

some stuff in there can actually become static; some belongs to pnode.h
as it's a private interface between namespace.c and pnode.c...

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:35 +0800

05 Nov, 2011

1 commit

3d0a8d10c Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...

Fic up trivial header file includsion conflict in drivers/block/loop.c

Linus Torvalds
2011-11-05 08:22:14 +0800

19 Oct, 2011

1 commit

523e1d399 block: make gendisk hold a reference to its queue ... Browse Code »
5

The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:

Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
RIP: 0010:[] [] __lock_acquire+0x389/0x1d60
...
Call Trace:
[] lock_acquire+0x95/0x140
[] _raw_spin_lock+0x3b/0x50
[] bdi_lock_two+0x5c/0x70
[] bdev_inode_switch_bdi+0x4c/0xf0
[] __blkdev_put+0x11b/0x1d0
[] __blkdev_put+0x160/0x1d0
[] blkdev_put+0x5f/0x190
[] kill_block_super+0x4d/0x80
[] deactivate_locked_super+0x45/0x70
[] deactivate_super+0x4a/0x70
[] mntput_no_expire+0xed/0x130
[] sys_umount+0x7e/0x3a0
[] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue. If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:31:07 +0800

10 Sep, 2011

1 commit

94007751b Avoid dereferencing a 'request_queue' after last close. ... Browse Code »

On the last close of an 'md' device which as been stopped, the device
is destroyed and in particular the request_queue is freed. The free
is done in a separate thread so it might happen a short time later.

__blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
called.

Since commit f758eeabeb96f878c860e8f110f94ec8820822a9
bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
inside a request_queue, to get a spin lock. This causes the last
close on an md device to sometime take a spin_lock which lives in
freed memory - which results in an oops.

So move the called to bdev_inode_switch_bdi before the call to
->release.

Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Andrew Morton
Cc: Wu Fengguang
Acked-by: Wu Fengguang
Cc: stable@kernel.org
Signed-off-by: NeilBrown

NeilBrown
2011-09-10 15:20:21 +0800

24 Aug, 2011

1 commit

d27769ec3 block: add GENHD_FL_NO_PART_SCAN ... Browse Code »

There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).

This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions. disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.

-v2: Updated to make it clear that only partition scan is suppressed
not partitioning itself as suggested by Kay Sievers.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-08-24 02:01:04 +0800

02 Aug, 2011

1 commit

da5aa861b fix block device fallout from ->fsync() changes ... Browse Code »

blkdev_fsync() needs to write pages in pagecache...

Signed-off-by: Rafael J. Wysocki
Signed-off-by: Al Viro

Rafael J. Wysocki
2011-08-02 09:33:47 +0800

01 Aug, 2011

1 commit

782b94cdf block: initialise bd_super in bdget() ... Browse Code »

bd_super is currently reset to NULL in kill_block_super() so we rely on previous
users of the block_device object to initialise this value for the next user.
This quirk was exposed on RHEL5 when a third party filesystem did not always use
kill_block_super() and therefore bd_super wasn't being reset when a block_device
object was recycled within the cache. This may not be a problem upstream but
makes sense to be defensive.

Signed-off-by: Lachlan McIlroy
Reviewed-by: Eric Sandeen
Signed-off-by: Al Viro

Lachlan McIlroy
2011-08-01 13:57:44 +0800

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

26 Jul, 2011

1 commit

096a705bb Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...

Linus Torvalds
2011-07-26 01:33:36 +0800

21 Jul, 2011

2 commits

02c24a821 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers ... Browse Code »

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara
Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:59 +0800
06222e491 fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek ... Browse Code »
2

This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:58 +0800

01 Jul, 2011

1 commit

85ef06d1d block: flush MEDIA_CHANGE from drivers on close(2) ... Browse Code »

Currently, only open(2) is defined as the 'clearing' point. It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events. Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage. Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers. disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex. disk_clear_events() can simply clear all events and the hint
can be boolean @flush.

Signed-off-by: Tejun Heo
Cc: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-07-01 22:17:47 +0800

13 Jun, 2011

1 commit

d4c208b86 block: use the passed in @bdev when claiming if partno is zero ... Browse Code »

6b4517a791 (block: implement bd_claiming and claiming block)
introduced claiming block to support O_EXCL blkdev opens properly.

bd_start_claiming() looks up the part 0 bdev and starts claiming
block. The function assumed that there is only one part 0 bdev and
always used bdget_disk(disk, 0) to look it up; unfortunately, this
isn't true for some drivers (floppy) which use multiple block devices
to denote different operating parameters for the same physical device.
There can be multiple part 0 bdev's for the same device number.

This incorrect assumption caused the wrong bdev to be used during
claiming leading to unbalanced bd_holders as reported in the following
bug.

https://bugzilla.kernel.org/show_bug.cgi?id=28522

This patch updates bd_start_claiming() such that it uses the bdev
specified as argument if its partno is zero.

Note that this means that different bdev's can be used for the same
device and O_EXCL check can be effectively bypassed. It has always
been broken that way and floppy is fortunately on its way out. Leave
that breakage alone.

Signed-off-by: Tejun Heo
Reported-by: Alex Villacis Lasso
Tested-by: Alex Villacis Lasso
Cc: stable@kernel.org # >= v2.6.36
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-13 18:45:48 +0800

08 Jun, 2011

1 commit

f758eeabe writeback: split inode_wb_list_lock into bdi_writeback.list_lock ... Browse Code »

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Wu Fengguang

Christoph Hellwig
2011-06-08 08:25:21 +0800

01 Jun, 2011

1 commit

4c49ff3fe block: blkdev_get() should access ->bd_disk only after success ... Browse Code »

d4dc210f69 (block: don't block events on excl write for non-optical
devices) added dereferencing of bdev->bd_disk to test
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
%NULL if open failed which can lead to an oops.

Test the flag after testing open was successful, not before.

Signed-off-by: Tejun Heo
Reported-by: David Miller
Tested-by: David Miller
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2011-06-01 14:28:47 +0800

23 May, 2011

1 commit

7e69723fe block: move bd_set_size() above rescan_partitions() in __blkdev_get() ... Browse Code »

02e352287a4 (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check. As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

sda: detected capacity change from 0 to 146815737856
sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a4 (block: rescan
partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo
Reported-by: Tony Luck
Tested-by: Tony Luck
Cc: stable@kernel.org

Stable note: 2.6.39 only.
Signed-off-by: Jens Axboe

Tejun Heo
2011-05-23 19:26:07 +0800

22 Apr, 2011

2 commits

d4dc210f6 block: don't block events on excl write for non-optical devices ... Browse Code »

Disk event code automatically blocks events on excl write. This is
primarily to avoid issuing polling commands while burning is in
progress. This behavior doesn't fit other types of devices with
removeable media where polling commands don't have adverse side
effects and door locking usually doesn't exist.

This patch introduces new genhd flag which controls the auto-blocking
behavior and uses it to enable auto-blocking only on optical devices.

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Kay Sievers
Signed-off-by: Jens Axboe

Tejun Heo
2011-04-22 02:54:46 +0800
1196f8b81 block: rescan partitions on invalidated devices on -ENOMEDIA too ... Browse Code »

__blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
which leads to ghost partition devices lingering after medimum removal
is known to both the kernel and userland. The behavior also creates a
subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
there's no medium, clears the ghots partitions, which is exploited to
work around the problem from userland.

Fix it by updating __blkdev_get() to issue partition rescan after
-ENOMEDIA too.

This was reported in the following bz.

https://bugzilla.kernel.org/show_bug.cgi?id=13029

Note for stable: 2.6.38 and later only

Cc: stable@kernel.org
Signed-off-by: Tejun Heo
Reported-by: David Zeuthen
Reported-by: Martin Pitt
Reported-by: Kay Sievers
Tested-by: Kay Sievers
Cc: Alan Cox
Signed-off-by: Jens Axboe

Tejun Heo
2011-04-22 02:54:45 +0800