13 Jan, 2012

1 commit

  • This makes it possible to get from the inode to the request_queue with one
    less cache miss. Used in followon optimization.

    The livetime of the pointer is the same as the gendisk.

    This assumes that the queue will always stay the same in the gendisk while
    it's visible to block_devices. I think that's safe correct?

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

11 Jan, 2012

1 commit

  • Kmemleak reports the following warning in bdev_cache_init()
    [ 0.003738] kmemleak: Object 0xffff880153035200 (size 256):
    [ 0.003823] kmemleak: comm "swapper/0", pid 0, jiffies 4294667299
    [ 0.003909] kmemleak: min_count = 1
    [ 0.003988] kmemleak: count = 0
    [ 0.004066] kmemleak: flags = 0x1
    [ 0.004144] kmemleak: checksum = 0
    [ 0.004224] kmemleak: backtrace:
    [ 0.004303] [] kmemleak_alloc+0x21/0x3e
    [ 0.004446] [] kmem_cache_alloc+0xca/0x1dc
    [ 0.004592] [] alloc_vfsmnt+0x1f/0x198
    [ 0.004736] [] vfs_kern_mount+0x36/0xd2
    [ 0.004879] [] kern_mount_data+0x18/0x32
    [ 0.005025] [] bdev_cache_init+0x51/0x81
    [ 0.005169] [] vfs_caches_init+0x101/0x10d
    [ 0.005313] [] start_kernel+0x344/0x383
    [ 0.005456] [] x86_64_start_reservations+0xae/0xb2
    [ 0.005602] [] x86_64_start_kernel+0x102/0x111
    [ 0.005747] [] 0xffffffffffffffff
    [ 0.008653] kmemleak: Trying to color unknown object at 0xffff880153035220 as Grey
    [ 0.008754] Pid: 0, comm: swapper/0 Not tainted 3.3.0-rc0-dbg-04200-g8180888-dirty #888
    [ 0.008856] Call Trace:
    [ 0.008934] [] ? find_and_get_object+0x44/0x118
    [ 0.009023] [] paint_ptr+0x57/0x8f
    [ 0.009109] [] kmemleak_not_leak+0x23/0x42
    [ 0.009195] [] bdev_cache_init+0x72/0x81
    [ 0.009282] [] vfs_caches_init+0x101/0x10d
    [ 0.009368] [] start_kernel+0x344/0x383
    [ 0.009466] [] x86_64_start_reservations+0xae/0xb2
    [ 0.009555] [] ? early_idt_handlers+0x140/0x140
    [ 0.009643] [] x86_64_start_kernel+0x102/0x111

    due to attempt to mark pointer to `struct vfsmount' as a gray object, which
    is embedded into `struct mount' returned from alloc_vfsmnt().

    Make `bd_mnt' static, avoiding need to tell kmemleak to mark it gray, as
    suggested by Al Viro.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Al Viro

    Sergey Senozhatsky
     

04 Jan, 2012

3 commits

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     
  • Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
    it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
    the cost of taking it into inode_init_always() will be negligible for pipes
    and sockets and negative for everything else. Not to mention the removal of
    boilerplate code from ->destroy_inode() instances...

    Signed-off-by: Al Viro

    Al Viro
     
  • some stuff in there can actually become static; some belongs to pnode.h
    as it's a private interface between namespace.c and pnode.c...

    Signed-off-by: Al Viro

    Al Viro
     

05 Nov, 2011

1 commit

  • * 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
    virtio-blk: use ida to allocate disk index
    hpsa: add small delay when using PCI Power Management to reset for kump
    cciss: add small delay when using PCI Power Management to reset for kump
    xen/blkback: Fix two races in the handling of barrier requests.
    xen/blkback: Check for proper operation.
    xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
    xen/blkback: Report VBD_WSECT (wr_sect) properly.
    xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
    xen-blkfront: plug device number leak in xlblk_init() error path
    xen-blkfront: If no barrier or flush is supported, use invalid operation.
    xen-blkback: use kzalloc() in favor of kmalloc()+memset()
    xen-blkback: fixed indentation and comments
    xen-blkfront: fix a deadlock while handling discard response
    xen-blkfront: Handle discard requests.
    xen-blkback: Implement discard requests ('feature-discard')
    xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
    drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
    drivers/block/loop.c: emit uevent on auto release
    drivers/block/cpqarray.c: use pci_dev->revision
    loop: always allow userspace partitions and optionally support automatic scanning
    ...

    Fic up trivial header file includsion conflict in drivers/block/loop.c

    Linus Torvalds
     

19 Oct, 2011

1 commit

  • The following command sequence triggers an oops.

    # mount /dev/sdb1 /mnt
    # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
    # umount /mnt

    general protection fault: 0000 [#1] PREEMPT SMP
    CPU 2
    Modules linked in:

    Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
    RIP: 0010:[] [] __lock_acquire+0x389/0x1d60
    ...
    Call Trace:
    [] lock_acquire+0x95/0x140
    [] _raw_spin_lock+0x3b/0x50
    [] bdi_lock_two+0x5c/0x70
    [] bdev_inode_switch_bdi+0x4c/0xf0
    [] __blkdev_put+0x11b/0x1d0
    [] __blkdev_put+0x160/0x1d0
    [] blkdev_put+0x5f/0x190
    [] kill_block_super+0x4d/0x80
    [] deactivate_locked_super+0x45/0x70
    [] deactivate_super+0x4a/0x70
    [] mntput_no_expire+0xed/0x130
    [] sys_umount+0x7e/0x3a0
    [] system_call_fastpath+0x16/0x1b

    This is because bdev holds on to disk but disk doesn't pin the
    associated queue. If a SCSI device is removed while the device is
    still open, the sdev puts the base reference to the queue on release.
    When the bdev is finally released, the associated queue is already
    gone along with the bdi and bdev_inode_switch_bdi() ends up
    dereferencing already freed bdi.

    Even if it were not for this bug, disk not holding onto the associated
    queue is very unusual and error-prone.

    Fix it by making add_disk() take an extra reference to its queue and
    put it on disk_release() and ensuring that disk and its fops owner are
    put in that order after all accesses to the disk and queue are
    complete.

    Signed-off-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Sep, 2011

1 commit

  • On the last close of an 'md' device which as been stopped, the device
    is destroyed and in particular the request_queue is freed. The free
    is done in a separate thread so it might happen a short time later.

    __blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
    called.

    Since commit f758eeabeb96f878c860e8f110f94ec8820822a9
    bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
    inside a request_queue, to get a spin lock. This causes the last
    close on an md device to sometime take a spin_lock which lives in
    freed memory - which results in an oops.

    So move the called to bdev_inode_switch_bdi before the call to
    ->release.

    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Cc: Wu Fengguang
    Acked-by: Wu Fengguang
    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

24 Aug, 2011

1 commit

  • There are cases where suppressing partition scan is useful - e.g. for
    lo devices and pseudo SATA devices which advertise to be a disk but
    get upset on partition scan (some port multiplier control devices show
    such behavior).

    This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
    regardless of the number of possible partitions. disk_partitionable()
    is renamed to disk_part_scan_enabled() as suppressing partition scan
    doesn't imply the device can't be partitioned using
    BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
    directly tests disk_max_parts() to maintain backward-compatibility.

    -v2: Updated to make it clear that only partition scan is suppressed
    not partitioning itself as suggested by Kay Sievers.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Aug, 2011

1 commit


01 Aug, 2011

1 commit

  • bd_super is currently reset to NULL in kill_block_super() so we rely on previous
    users of the block_device object to initialise this value for the next user.
    This quirk was exposed on RHEL5 when a third party filesystem did not always use
    kill_block_super() and therefore bd_super wasn't being reset when a block_device
    object was recycled within the cache. This may not be a problem upstream but
    makes sense to be defensive.

    Signed-off-by: Lachlan McIlroy
    Reviewed-by: Eric Sandeen
    Signed-off-by: Al Viro

    Lachlan McIlroy
     

27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

1 commit

  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

21 Jul, 2011

2 commits

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
    we just return -EINVAL, in others we do the normal generic thing, and in others
    we're simply making sure that the properly due-dilligence is done. For example
    in NFS/CIFS we need to make sure the file size is update properly for the
    SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
    that is all we have to do. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

01 Jul, 2011

1 commit

  • Currently, only open(2) is defined as the 'clearing' point. It has
    two roles - first, it's an acknowledgement from userland indicating
    that the event has been received and kernel can clear pending states
    and proceed to generate more events. Secondly, it's passed on to
    device drivers as a hint indicating that a synchronization point has
    been reached and it might want to take a deeper look at the device.

    The latter currently is only used by sr which uses two different
    mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
    to discover events, where the former is lighter weight and safe to be
    used repeatedly but may not provide full coverage. Among other
    things, GET_EVENT can't detect media removal while TUR can.

    This patch makes close(2) - blkdev_put() - indicate clearing hint for
    MEDIA_CHANGE to drivers. disk_check_events() is renamed to
    disk_flush_events() and updated to take @mask for events to flush
    which is or'd to ev->clearing and will be passed to the driver on the
    next ->check_events() invocation.

    This change makes sr generate MEDIA_CHANGE when media is ejected from
    userland - e.g. with eject(1).

    Note: Given the current usage, it seems @clearing hint is needlessly
    complex. disk_clear_events() can simply clear all events and the hint
    can be boolean @flush.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Jun, 2011

1 commit

  • 6b4517a791 (block: implement bd_claiming and claiming block)
    introduced claiming block to support O_EXCL blkdev opens properly.

    bd_start_claiming() looks up the part 0 bdev and starts claiming
    block. The function assumed that there is only one part 0 bdev and
    always used bdget_disk(disk, 0) to look it up; unfortunately, this
    isn't true for some drivers (floppy) which use multiple block devices
    to denote different operating parameters for the same physical device.
    There can be multiple part 0 bdev's for the same device number.

    This incorrect assumption caused the wrong bdev to be used during
    claiming leading to unbalanced bd_holders as reported in the following
    bug.

    https://bugzilla.kernel.org/show_bug.cgi?id=28522

    This patch updates bd_start_claiming() such that it uses the bdev
    specified as argument if its partno is zero.

    Note that this means that different bdev's can be used for the same
    device and O_EXCL check can be effectively bypassed. It has always
    been broken that way and floppy is fortunately on its way out. Leave
    that breakage alone.

    Signed-off-by: Tejun Heo
    Reported-by: Alex Villacis Lasso
    Tested-by: Alex Villacis Lasso
    Cc: stable@kernel.org # >= v2.6.36
    Signed-off-by: Jens Axboe

    Tejun Heo
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

01 Jun, 2011

1 commit

  • d4dc210f69 (block: don't block events on excl write for non-optical
    devices) added dereferencing of bdev->bd_disk to test
    GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
    %NULL if open failed which can lead to an oops.

    Test the flag after testing open was successful, not before.

    Signed-off-by: Tejun Heo
    Reported-by: David Miller
    Tested-by: David Miller
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 May, 2011

1 commit

  • 02e352287a4 (block: rescan partitions on invalidated devices on
    -ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
    to simplify condition check. As rescan_partitions() does its own bdev
    size setting, this doesn't break anything; however,
    rescan_partitions() prints out the following messages when adjusting
    bdev size, which can be confusing.

    sda: detected capacity change from 0 to 146815737856
    sdb: detected capacity change from 0 to 146815737856

    This patch restores the original order and remove the warning
    messages.

    stable: Please apply together with 02e352287a4 (block: rescan
    partitions on invalidated devices on -ENOMEDIA too).

    Signed-off-by: Tejun Heo
    Reported-by: Tony Luck
    Tested-by: Tony Luck
    Cc: stable@kernel.org

    Stable note: 2.6.39 only.
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 Apr, 2011

2 commits

  • Disk event code automatically blocks events on excl write. This is
    primarily to avoid issuing polling commands while burning is in
    progress. This behavior doesn't fit other types of devices with
    removeable media where polling commands don't have adverse side
    effects and door locking usually doesn't exist.

    This patch introduces new genhd flag which controls the auto-blocking
    behavior and uses it to enable auto-blocking only on optical devices.

    Note for stable: 2.6.38 and later only

    Cc: stable@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • __blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
    which leads to ghost partition devices lingering after medimum removal
    is known to both the kernel and userland. The behavior also creates a
    subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
    there's no medium, clears the ghots partitions, which is exploited to
    work around the problem from userland.

    Fix it by updating __blkdev_get() to issue partition rescan after
    -ENOMEDIA too.

    This was reported in the following bz.

    https://bugzilla.kernel.org/show_bug.cgi?id=13029

    Note for stable: 2.6.38 and later only

    Cc: stable@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: David Zeuthen
    Reported-by: Martin Pitt
    Reported-by: Kay Sievers
    Tested-by: Kay Sievers
    Cc: Alan Cox
    Signed-off-by: Jens Axboe

    Tejun Heo
     

31 Mar, 2011

1 commit


25 Mar, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

19 Mar, 2011

1 commit


10 Mar, 2011

5 commits

  • Conflicts:
    block/blk-core.c
    block/blk-flush.c
    drivers/md/raid1.c
    drivers/md/raid10.c
    drivers/md/raid5.c
    fs/nilfs2/btnode.c
    fs/nilfs2/mdt.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not all block drivers clear events immediately after reporting. Some
    do so in ->revalidate_disk() or other steps during ->open(). There is
    a slim chance event poll may happen between the clearing event check
    from check_disk_change() and the actual clearing of the events which
    would result in spurious events.

    Block event checks while block device open is in progress. There is
    no need to kick explicit event check afterwards as events are always
    checked during open.

    -v2: The original patch could have called disk_unblock_events() with
    an already released or %NULL @disk causing oops. Fixed by making
    sure references are put after disk_unblock_events() is called.
    It also makes the error path of __blkdev_get() a bit simpler.
    This problem was reported by Jens.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Kay Sievers

    Tejun Heo
     
  • The block event mechanism currently always checks events when the
    device is being closed regardless of the open mode. The intention was
    to allow detection of EJECT_REQUEST when a device is closed whether
    disk event polling is enabled or not.

    This is unnecessary as, for devices of interest, events are checked
    from either userland or kernel and in the former case ->check_events()
    is performed on open of each poll attempt anyway. Furthermore, this
    unconditional event check on close makes the code susceptible to event
    loop if the block driver doesn't clear reported events correctly - an
    event triggers userland to open and close the device which in turn
    causes another event, rinse and repeat.

    Check events on close only if it was blocked by excl write open.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Kay Sievers

    Tejun Heo
     
  • Currently, disk_unblock_events() implicitly kick event check if the
    block count reaches zero. This behavior is not described in the
    comment and hinders with future changes. Make the unblocker
    explicitly check events by calling disk_check_events() as necessary.

    This patch doesn't cause any behavior difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Kay Sievers

    Tejun Heo
     

01 Mar, 2011

1 commit


26 Feb, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds
     

25 Feb, 2011

1 commit

  • The new implementation of bd_link_disk_holder() added by 49731baa41d
    (block: restore multiple bd_link_disk_holder() support) didn't get an
    extra reference for the holder_dir kobject of the slave bdev; however,
    bdev kills holder_dir on removal, not release, so if the slave bdev is
    removed while there are holder links, the holder_dir will be destroyed
    while there still are holder links, which leads to oops later when
    bd_unlink_disk_order() tries to remove those links.

    Make bd_link_disk_holder() grab an extra reference for the slave's
    holder_dir and put it in bd_unlink_disk_holder().

    Signed-off-by: Tejun Heo
    Reported-by: "Hawrylewicz Czarnowski, Przemyslaw"
    Tested-by: "Hawrylewicz Czarnowski, Przemyslaw"
    Cc: Neil Brown
    Cc: Jens Axboe
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

24 Feb, 2011

1 commit

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     

17 Feb, 2011

1 commit

  • This reverts commit 75f1dc0d076d ("block: check bdev_read_only() from
    blkdev_get()"). That commit added stricter checking to make sure
    devices that were being used read-only were actually opened in that
    mode.

    It turns out that the change breaks a bunch of kernel code that opens
    block devices. Affected systems include dm, md, and the loop device.
    Because strict checking for read-only opens of block devices was not
    done before this, the code that opens the devices was opening them
    read-write even if they were being used read-only. Auditing all that
    code will take time, and new userspace packages for dm, mdadm, etc.
    will also be required.

    Signed-off-by: Chuck Ebbert
    Signed-off-by: Linus Torvalds

    Chuck Ebbert
     

15 Jan, 2011

1 commit

  • Commit e09b457b (block: simplify holder symlink handling) incorrectly
    assumed that there is only one link at maximum. dm may use multiple
    links and expects block layer to track reference count for each link,
    which is different from and unrelated to the exclusive device holder
    identified by @holder when the device is opened.

    Remove the single holder assumption and automatic removal of the link
    and revive the per-link reference count tracking. The code
    essentially behaves the same as before commit e09b457b sans the
    unnecessary kobject reference count dancing.

    While at it, note that this facility should not be used by anyone else
    than the current ones. Sysfs symlinks shouldn't be abused like this
    and the whole thing doesn't belong in the block layer at all.

    Signed-off-by: Tejun Heo
    Reported-by: Milan Broz
    Cc: Jun'ichi Nomura
    Cc: Neil Brown
    Cc: linux-raid@vger.kernel.org
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

13 Jan, 2011

1 commit