16 Jan, 2012

1 commit

  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     

07 Jan, 2012

1 commit


04 Jan, 2012

3 commits


14 Dec, 2011

1 commit

  • * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
    failure instead of 0 / -errno or boolean. Update it such that it
    returns %true on success and %false on failure.

    * Make sure the caller checks for the return value.

    * Separate out __blk_get_queue() which doesn't check whether @q is
    dead and put it in blk.h. This will be used later.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Nov, 2011

1 commit

  • This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10.

    The commit introduced alias for block devices which is intended to be
    used during logging although actual usage hasn't been committed yet.
    This approach adds very limited benefit (raw log might be easier to
    follow) which can be trivially implemented in userland but has a lot
    of problems.

    It is much worse than netif renames because it doesn't rename the
    actual device but just adds conveninence name which isn't used
    universally or enforced. Everything internal including device lookup
    and sysfs still uses the internal name and nothing prevents two
    devices from using conflicting alias - ie. sda can have sdb as its
    alias.

    This has been nacked by people working on device driver core, block
    layer and kernel-userland interface and shouldn't have been
    upstreamed. Revert it.

    http://thread.gmane.org/gmane.linux.kernel/1155104
    http://thread.gmane.org/gmane.linux.scsi/68632
    http://thread.gmane.org/gmane.linux.scsi/69776

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Acked-by: Kay Sievers
    Cc: "James E.J. Bottomley"
    Cc: Nao Nishijima
    Cc: Alan Cox
    Cc: Al Viro
    Signed-off-by: Jens Axboe

    Tejun Heo
     

05 Nov, 2011

2 commits

  • * 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
    virtio-blk: use ida to allocate disk index
    hpsa: add small delay when using PCI Power Management to reset for kump
    cciss: add small delay when using PCI Power Management to reset for kump
    xen/blkback: Fix two races in the handling of barrier requests.
    xen/blkback: Check for proper operation.
    xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
    xen/blkback: Report VBD_WSECT (wr_sect) properly.
    xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
    xen-blkfront: plug device number leak in xlblk_init() error path
    xen-blkfront: If no barrier or flush is supported, use invalid operation.
    xen-blkback: use kzalloc() in favor of kmalloc()+memset()
    xen-blkback: fixed indentation and comments
    xen-blkfront: fix a deadlock while handling discard response
    xen-blkfront: Handle discard requests.
    xen-blkback: Implement discard requests ('feature-discard')
    xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
    drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
    drivers/block/loop.c: emit uevent on auto release
    drivers/block/cpqarray.c: use pci_dev->revision
    loop: always allow userspace partitions and optionally support automatic scanning
    ...

    Fic up trivial header file includsion conflict in drivers/block/loop.c

    Linus Torvalds
     
  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

19 Oct, 2011

1 commit

  • The following command sequence triggers an oops.

    # mount /dev/sdb1 /mnt
    # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
    # umount /mnt

    general protection fault: 0000 [#1] PREEMPT SMP
    CPU 2
    Modules linked in:

    Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
    RIP: 0010:[] [] __lock_acquire+0x389/0x1d60
    ...
    Call Trace:
    [] lock_acquire+0x95/0x140
    [] _raw_spin_lock+0x3b/0x50
    [] bdi_lock_two+0x5c/0x70
    [] bdev_inode_switch_bdi+0x4c/0xf0
    [] __blkdev_put+0x11b/0x1d0
    [] __blkdev_put+0x160/0x1d0
    [] blkdev_put+0x5f/0x190
    [] kill_block_super+0x4d/0x80
    [] deactivate_locked_super+0x45/0x70
    [] deactivate_super+0x4a/0x70
    [] mntput_no_expire+0xed/0x130
    [] sys_umount+0x7e/0x3a0
    [] system_call_fastpath+0x16/0x1b

    This is because bdev holds on to disk but disk doesn't pin the
    associated queue. If a SCSI device is removed while the device is
    still open, the sdev puts the base reference to the queue on release.
    When the bdev is finally released, the associated queue is already
    gone along with the bdi and bdev_inode_switch_bdi() ends up
    dereferencing already freed bdi.

    Even if it were not for this bug, disk not holding onto the associated
    queue is very unusual and error-prone.

    Fix it by making add_disk() take an extra reference to its queue and
    put it on disk_release() and ensuring that disk and its fops owner are
    put in that order after all accesses to the disk and queue are
    complete.

    Signed-off-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 Aug, 2011

1 commit

  • This patch allows the user to set an "alias" of the disk via sysfs interface.

    This patch only adds a new attribute "alias" in gendisk structure.
    To show the alias instead of the device name in kernel messages,
    we need to revise printk messages and use alias_name() in them.

    Example:
    (current) printk("disk name is %s\n", disk->disk_name);
    (new) printk("disk name is %s\n", alias_name(disk));

    Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
    have an "alias" which length is up to 255 bytes. This attribute is write-once.

    Suggested-by: James Bottomley
    Suggested-by: Jon Masters
    Signed-off-by: Nao Nishijima
    Signed-off-by: James Bottomley

    Nao Nishijima
     

24 Aug, 2011

1 commit

  • There are cases where suppressing partition scan is useful - e.g. for
    lo devices and pseudo SATA devices which advertise to be a disk but
    get upset on partition scan (some port multiplier control devices show
    such behavior).

    This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
    regardless of the number of possible partitions. disk_partitionable()
    is renamed to disk_part_scan_enabled() as suppressing partition scan
    doesn't imply the device can't be partitioned using
    BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
    directly tests disk_max_parts() to maintain backward-compatibility.

    -v2: Updated to make it clear that only partition scan is suppressed
    not partitioning itself as suggested by Kay Sievers.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Aug, 2011

1 commit

  • Remove the (unsigned long long) cast in diskstats_show() and adjusts the
    seq_printf() format string to 'unsigned long'

    diskstats_show() uses part_stat_read() to get the stats, which either
    accesses the specified field in the struct disk_stats directly (non SMP)
    or sums up the per CPU values in a variable of the same type as the field,
    so in any case the result will have the same type and range as the
    specified field which for all disk_stats entries is unsigned long

    Also, for unsigned long ranges the output of %lu should be identical to
    the one of %llu, so no change in the actual proc entry contents.

    Signed-off-by: Herbert Poetzl
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Herbert Poetzl
     

26 Jul, 2011

1 commit

  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

21 Jul, 2011

1 commit


01 Jul, 2011

2 commits

  • Currently, only open(2) is defined as the 'clearing' point. It has
    two roles - first, it's an acknowledgement from userland indicating
    that the event has been received and kernel can clear pending states
    and proceed to generate more events. Secondly, it's passed on to
    device drivers as a hint indicating that a synchronization point has
    been reached and it might want to take a deeper look at the device.

    The latter currently is only used by sr which uses two different
    mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
    to discover events, where the former is lighter weight and safe to be
    used repeatedly but may not provide full coverage. Among other
    things, GET_EVENT can't detect media removal while TUR can.

    This patch makes close(2) - blkdev_put() - indicate clearing hint for
    MEDIA_CHANGE to drivers. disk_check_events() is renamed to
    disk_flush_events() and updated to take @mask for events to flush
    which is or'd to ev->clearing and will be passed to the driver on the
    next ->check_events() invocation.

    This change makes sr generate MEDIA_CHANGE when media is ejected from
    userland - e.g. with eject(1).

    Note: Given the current usage, it seems @clearing hint is needlessly
    complex. disk_clear_events() can simply clear all events and the hint
    can be boolean @flush.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Conflicts:
    block/blk-throttle.c
    block/cfq-iosched.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Jun, 2011

1 commit


10 Jun, 2011

3 commits

  • disk_block_events() should guarantee that the event work is not in
    flight on return and once blocked it shouldn't issue further
    cancellations.

    Because there was no synchronization between the first blocker doing
    cancel_delayed_work_sync() and the following blockers, the following
    blockers could finish before cancellation was complete, which broke
    both guarantees - event work could be in flight and cancellation could
    happen after return.

    This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
    bug#34662.

    https://bugzilla.kernel.org/show_bug.cgi?id=34662

    Fix it by adding an outer mutex which protects both block count
    manipulation and work cancellation.

    -v2: Use outer mutex instead of bit waitqueue per Linus.

    Signed-off-by: Tejun Heo
    Tested-by: Sitsofe Wheeler
    Reported-by: Sitsofe Wheeler
    Reported-by: Borislav Petkov
    Reported-by: Meelis Roos
    Reported-by: Linus Torvalds
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • After the previous update to disk_check_events(), nobody is using
    non-syncing __disk_block_events(). Remove @sync and, as this makes
    __disk_block_events() virtually identical to disk_block_events(),
    remove the underscore prefixed version.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This patch is part of fix for triggering of WARN_ON_ONCE() in
    disk_clear_events() reported in bug#34662.

    https://bugzilla.kernel.org/show_bug.cgi?id=34662

    disk_clear_events() blocks events, schedules and flushes the event
    work. It expects the work to have started execution on schedule and
    finished on return from flush. WARN_ON_ONCE() triggers if the event
    work hasn't executed as expected. This problem happens because
    __disk_block_events() fails to guarantee that the event work item is
    not in flight on return from the function in race-free manner. The
    problem is two-fold and this patch addresses one of them.

    When __disk_block_events() is called with @sync == %false, it bumps
    event block count, calls cancel_delayed_work() and return. This makes
    it impossible to guarantee that event polling is not in flight on
    return from syncing __disk_block_events() - if the first blocker was
    non-syncing, polling could still be in progress and later syncing ones
    would assume that the first blocker already canceled it.

    Making __disk_block_events() cancel_sync regardless of block count
    isn't feasible either as it may race with forced event checking in
    disk_clear_events().

    As disk_check_events() is the only user of non-syncing
    __disk_block_events(), updating it to directly cancel and schedule
    event work is the easiest way to solve the issue.

    Note that there's another bug in __disk_block_events() and this patch
    doesn't fix the issue completely. Later patch will fix the other bug.

    Signed-off-by: Tejun Heo
    Tested-by: Sitsofe Wheeler
    Reported-by: Sitsofe Wheeler
    Reported-by: Borislav Petkov
    Reported-by: Meelis Roos
    Reported-by: Linus Torvalds
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 May, 2011

1 commit

  • 9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
    drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
    drivers which have inadequate ->check_events(). Combined with earlier
    change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
    userland), this enables using ->check_events() for internal processing
    while avoiding enabling in-kernel block event polling which can lead
    to infinite event loop.

    Unfortunately, this made many drivers including floppy without any bit
    set in disk->events and ->async_events in which case disk_add_events()
    simply skipped allocation of disk->ev, which disables whole event
    handling. As ->check_events() is still used during open processing
    for revalidation, this can lead to open failure.

    This patch always allocates disk->ev if ->check_events is implemented.
    In the long term, it would make sense to simply include the event
    structure inline into genhd as it's now used by virtually all block
    devices.

    Signed-off-by: Tejun Heo
    Reported-by: Ondrej Zary
    Reported-by: Alex Villacis Lasso
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 Apr, 2011

1 commit

  • DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and
    internal event for revalidation of removeable devices. Some legacy
    drivers don't implement proper event detection and continuously
    generate events under certain circumstances. For example, ide-cd
    generates media changed continuously if there's no media in the drive,
    which can lead to infinite loop of events jumping back and forth
    between the driver and userland event handler.

    This patch updates disk event infrastructure such that it never
    propagates events not listed in disk->events to userland. Those
    events are processed the same for internal purposes but uevent
    generation is suppressed.

    This also ensures that userland only gets events which are advertised
    in the @events sysfs node lowering risk of confusion.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

31 Mar, 2011

1 commit


10 Mar, 2011

1 commit

  • Currently, disk_unblock_events() implicitly kick event check if the
    block count reaches zero. This behavior is not described in the
    comment and hinders with future changes. Make the unblocker
    explicitly check events by calling disk_check_events() as necessary.

    This patch doesn't cause any behavior difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Kay Sievers

    Tejun Heo
     

05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

03 Mar, 2011

1 commit


24 Feb, 2011

1 commit

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     

13 Jan, 2011

1 commit


07 Jan, 2011

1 commit


05 Jan, 2011

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    Also add a refcount to struct hd_struct to keep the partition in
    memory as long as users exist. We use kref_test_and_get() to ensure
    we don't add a reference to a partition which is going away.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

17 Dec, 2010

5 commits

  • The major/minor device numbers are always defined and used as `unsigned'.

    Signed-off-by: Yang Zhang
    Signed-off-by: Jens Axboe

    Yang Zhang
     
  • Signed-off-by: Yang Zhang
    Signed-off-by: Jens Axboe

    Yang Zhang
     
  • Currently, media presence polling for removeable block devices is done
    from userland. There are several issues with this.

    * Polling is done by periodically opening the device. For SCSI
    devices, the command sequence generated by such action involves a
    few different commands including TEST_UNIT_READY. This behavior,
    while perfectly legal, is different from Windows which only issues
    single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
    ATAPI devices lock up after being periodically queried such command
    sequences.

    * There is no reliable and unintrusive way for a userland program to
    tell whether the target device is safe for media presence polling.
    For example, polling for media presence during an on-going burning
    session can make it fail. The polling program can avoid this by
    opening the device with O_EXCL but then it risks making a valid
    exclusive user of the device fail w/ -EBUSY.

    * Userland polling is unnecessarily heavy and in-kernel implementation
    is lighter and better coordinated (workqueue, timer slack).

    This patch implements framework for in-kernel disk event handling,
    which includes media presence polling.

    * bdops->check_events() is added, which supercedes ->media_changed().
    It should check whether there's any pending event and return if so.
    Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
    DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
    called parallelly.

    * gendisk->events and ->async_events are added. These should be
    initialized by block driver before passing the device to add_disk().
    The former contains the mask of all supported events and the latter
    the mask of all events which the device can report without polling.
    /sys/block/*/events[_async] export these to userland.

    * Kernel parameter block.events_dfl_poll_msecs controls the system
    polling interval (default is 0 which means disable) and
    /sys/block/*/events_poll_msecs control polling intervals for
    individual devices (default is -1 meaning use system setting). Note
    that if a device can report all supported events asynchronously and
    its polling interval isn't explicitly set, the device won't be
    polled regardless of the system polling interval.

    * If a device is opened exclusively with write access, event checking
    is automatically disabled until all write exclusive accesses are
    released.

    * There are event 'clearing' events. For example, both of currently
    defined events are cleared after the device has been successfully
    opened. This information is passed to ->check_events() callback
    using @clearing argument as a hint.

    * Event checking is always performed from system_nrt_wq and timer
    slack is set to 25% for polling.

    * Nothing changes for drivers which implement ->media_changed() but
    not ->check_events(). Going forward, all drivers will be converted
    to ->check_events() and ->media_change() will be dropped.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no reason for register_disk() and del_gendisk() to be in
    fs/partitions/check.c. Move both to genhd.c. While at it, collapse
    unlink_gendisk(), which was artificially in a separate function due to
    genhd.c / check.c split, into del_gendisk().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no user of the facility. Kill it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Oct, 2010

1 commit


23 Oct, 2010

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits)
    driver core: Display error codes when class suspend fails
    Driver core: Add section count to memory_block struct
    Driver core: Add mutex for adding/removing memory blocks
    Driver core: Move find_memory_block routine
    hpilo: Despecificate driver from iLO generation
    driver core: Convert link_mem_sections to use find_memory_block_hinted.
    driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted.
    kobject: Introduce kset_find_obj_hinted.
    driver core: fix build for CONFIG_BLOCK not enabled
    driver-core: base: change to new flag variable
    sysfs: only access bin file vm_ops with the active lock
    sysfs: Fail bin file mmap if vma close is implemented.
    FW_LOADER: fix kconfig dependency warning on HOTPLUG
    uio: Statically allocate uio_class and use class .dev_attrs.
    uio: Support 2^MINOR_BITS minors
    uio: Cleanup irq handling.
    uio: Don't clear driver data
    uio: Fix lack of locking in init_uio_class
    SYSFS: Allow boot time switching between deprecated and modern sysfs layout
    driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices
    ...

    Linus Torvalds
     
  • I have some systems which need legacy sysfs due to old tools that are
    making assumptions that a directory can never be a symlink to another
    directory, and it's a big hazzle to compile separate kernels for them.

    This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
    that can be switched on/off the kernel command line. This way
    the same binary can be used in both cases with just a option
    on the command line.

    The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
    the default. I kept the weird name to not break existing
    config files.

    Also the compat code can be still completely disabled by undefining
    CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
    care of this now instead of lots of ifdefs. This makes the code
    look nicer.

    v2: This is an updated version on top of Kay's patch to only
    handle the block devices. I tested it on my old systems
    and that seems to work.

    Cc: axboe@kernel.dk
    Signed-off-by: Andi Kleen
    Cc: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

19 Oct, 2010

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    When reloading partition tables, quiesce IO to ensure that no
    request references to the partition struct exists. When it is safe
    to free the partition table, the IO for that device is restarted
    again.

    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu