16 Jan, 2015

1 commit

  • commit 5fabcb4c33fe11c7e3afdf805fde26c1a54d0953 upstream.

    We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
    with a user passed in partno value. If we pass in 0x7fffffff, the
    new target in disk_expand_part_tbl() overflows the 'int' and we
    access beyond the end of ptbl->part[] and even write to it when we
    do the rcu_assign_pointer() to assign the new partition.

    Reported-by: David Ramos
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

23 Sep, 2014

1 commit

  • Commit 2da78092 changed the locking from a mutex to a spinlock,
    so we now longer sleep in this context. But there was a leftover
    might_sleep() in there, which now triggers since we do the final
    free from an RCU callback. Get rid of it.

    Reported-by: Pontus Fuchs
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Sep, 2014

1 commit


04 Sep, 2014

1 commit

  • Releases the dev_t minor when all references are closed to prevent
    another device from acquiring the same major/minor.

    Since the partition's release may be invoked from call_rcu's soft-irq
    context, the ext_dev_idr's mutex had to be replaced with a spinlock so
    as not so sleep.

    Signed-off-by: Keith Busch
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     

12 Sep, 2013

1 commit


04 Jul, 2013

1 commit

  • Disk names may contain arbitrary strings, so they must not be
    interpreted as format strings. It seems that only md allows arbitrary
    strings to be used for disk names, but this could allow for a local
    memory corruption from uid 0 into ring 0.

    CVE-2013-2851

    Signed-off-by: Kees Cook
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

15 May, 2013

1 commit

  • Block layer uses workqueues for multiple purposes. There is no real dependency
    of scheduling these on the cpu which scheduled them.

    On a idle system, it is observed that and idle cpu wakes up many times just to
    service this work. It would be better if we can schedule it on a cpu which the
    scheduler believes to be the most appropriate one.

    This patch replaces normal workqueues with power efficient versions.

    Cc: Jens Axboe
    Signed-off-by: Viresh Kumar
    Signed-off-by: Tejun Heo

    Viresh Kumar
     

12 Apr, 2013

1 commit


08 Apr, 2013

1 commit

  • Some drivers want to tell userspace what uid and gid should be used for
    their device nodes, so allow that information to percolate through the
    driver core to userspace in order to make this happen. This means that
    some systems (i.e. Android and friends) will not need to even run a
    udev-like daemon for their device node manager and can just rely in
    devtmpfs fully, reducing their footprint even more.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

28 Feb, 2013

3 commits

  • Convert to the much saner new idr interface. Both bsg and genhd
    protect idr w/ mutex making preloading unnecessary.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • idr allocation in blk_alloc_devt() wasn't synchronized against lookup
    and removal, and its limit check was off by one - 1 << MINORBITS is
    the number of minors allowed, not the maximum allowed minor.

    Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
    checking.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • While adding and removing a lot of disks disks and partitions this
    sometimes shows up:

    WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
    Hardware name:
    sysfs: cannot create duplicate filename '/dev/block/259:751'
    Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
    Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
    Call Trace:
    warn_slowpath_common+0x87/0xc0
    warn_slowpath_fmt+0x46/0x50
    sysfs_add_one+0xc9/0x130
    sysfs_do_create_link+0x12b/0x170
    sysfs_create_link+0x13/0x20
    device_add+0x317/0x650
    idr_get_new+0x13/0x50
    add_partition+0x21c/0x390
    rescan_partitions+0x32b/0x470
    sd_open+0x81/0x1f0 [sd_mod]
    __blkdev_get+0x1b6/0x3c0
    blkdev_get+0x10/0x20
    register_disk+0x155/0x170
    add_disk+0xa6/0x160
    sd_probe_async+0x13b/0x210 [sd_mod]
    add_wait_queue+0x46/0x60
    async_thread+0x102/0x250
    default_wake_function+0x0/0x20
    async_thread+0x0/0x250
    kthread+0x96/0xa0
    child_rip+0xa/0x20
    kthread+0x0/0xa0
    child_rip+0x0/0x20

    This most likely happens because dev_t is freed while the number is
    still used and idr_get_new() is not protected on every use. The fix
    adds a mutex where it wasn't before and moves the dev_t free function so
    it is called after device del.

    Signed-off-by: Tomas Henzl
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomas Henzl
     

24 Feb, 2013

1 commit

  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     

20 Dec, 2012

2 commits

  • Remove a race condition which causes a warning in disk_clear_events. This
    is a race between disk_clear_events() and disk_flush_events().
    ev->clearing will be altered by disk_flush_events() even though we are
    blocking event checking through disk_flush_events(). If this happens
    after ev->clearing was cleared for disk_clear_events(), this can cause the
    WARN_ON_ONCE() in that function to be triggered.

    This change also has disk_clear_events() not go through a workqueue.
    Since we have to wait for the work to complete, we should just call the
    function directly. Also, since this work cannot be put on a freezable
    workqueue, it will have to contend with increased demand, so calling the
    function directly avoids this.

    [akpm@linux-foundation.org: fix spello in comment]
    Signed-off-by: Derek Basehore
    Cc: Mandeep Singh Baines
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Derek Basehore
     
  • In disk_clear_events, do not put work on system_nrt_freezable_wq.
    Instead, put it on system_nrt_wq.

    There is a race between probing a usb and suspending the device. Since
    probing a usb calls disk_clear_events, which puts work on a frozen
    workqueue, probing cannot finish after the workqueue is frozen. However,
    suspending cannot finish until the usb probe is finished, so we get a
    deadlock, causing the system to reboot.

    The way to reproduce this bug is to wake up from suspend with a usb
    storage device plugged in, or plugging in a usb storage device right
    before suspend. The window of time is on the order of time it takes to
    probe the usb device. As long as the workqueues are frozen before the
    call to add_disk within sd_probe_async finishes, there will be a deadlock
    (which calls blkdev_get, sd_open, check_disk_change, then
    disk_clear_events). This is not difficult to reproduce after figuring out
    the timings.

    [akpm@linux-foundation.org: fix up comment]
    Signed-off-by: Derek Basehore
    Reviewed-by: Mandeep Singh Baines
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Derek Basehore
     

18 Dec, 2012

1 commit

  • Pull block driver update from Jens Axboe:
    "Now that the core bits are in, here are the driver bits for 3.8. The
    branch contains:

    - A huge pile of drbd bits that were dumped from the 3.7 merge
    window. Following that, it was both made perfectly clear that
    there is going to be no more over-the-wall pulls and how the
    situation on individual pulls can be improved.

    - A few cleanups from Akinobu Mita for drbd and cciss.

    - Queue improvement for loop from Lukas. This grew into adding a
    generic interface for waiting/checking an even with a specific
    lock, allowing this to be pulled out of md and now loop and drbd is
    also using it.

    - A few fixes for xen back/front block driver from Roger Pau Monne.

    - Partition improvements from Stephen Warren, allowing partiion UUID
    to be used as an identifier."

    * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
    drbd: update Kconfig to match current dependencies
    drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
    drbd: close race between drbd_set_role and drbd_connect
    drbd: respect no-md-barriers setting also when changed online via disk-options
    drbd: Remove obsolete check
    drbd: fixup after wait_even_lock_irq() addition to generic code
    loop: Limit the number of requests in the bio list
    wait: add wait_event_lock_irq() interface
    xen-blkfront: free allocated page
    xen-blkback: move free persistent grants code
    block: partition: msdos: provide UUIDs for partitions
    init: reduce PARTUUID min length to 1 from 36
    block: store partition_meta_info.uuid as a string
    cciss: use check_signature()
    cciss: cleanup bitops usage
    drbd: use copy_highpage
    drbd: if the replication link breaks during handshake, keep retrying
    drbd: check return of kmalloc in receive_uuids
    drbd: Broadcast sync progress no more often than once per second
    drbd: don't try to clear bits once the disk has failed
    ...

    Linus Torvalds
     

23 Nov, 2012

1 commit

  • This will allow other types of UUID to be stored here, aside from true
    UUIDs. This also simplifies code that uses this field, since it's usually
    constructed from a, used as a, or compared to other, strings.

    Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
    /PARTNROFF option was found to be present. However, this modifies the
    input string, and causes subsequent calls to devt_from_partuuid() not to
    see the /PARTNROFF option, which causes different results. In order to
    avoid misleading future maintainers, this parameter is marked const.

    Signed-off-by: Stephen Warren
    Cc: Tejun Heo
    Cc: Will Drewry
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Stephen Warren
     

10 Nov, 2012

1 commit


03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

21 Aug, 2012

1 commit

  • system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
    convert all users to system[_freezable]_wq.

    If you're cc'd and wondering what's going on: Now all workqueues are
    non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
    Please use system[_freezable]_wq instead.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-By: Lai Jiangshan

    Cc: Jens Axboe
    Cc: David Airlie
    Cc: Jiri Kosina
    Cc: "David S. Miller"
    Cc: Rusty Russell
    Cc: "Paul E. McKenney"
    Cc: David Howells

    Tejun Heo
     

14 Aug, 2012

1 commit

  • Convert delayed_work users doing cancel_delayed_work() followed by
    queue_delayed_work() to mod_delayed_work().

    Most conversions are straight-forward. Ones worth mentioning are,

    * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
    use mod_delayed_work() and cancel loop in
    edac_mc_reset_delay_period() is dropped.

    * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
    watchdog is active or not. @fan_watchdog_active and related code
    dropped.

    * drivers/power/charger-manager.c: Seemingly a lot of
    delayed_work_pending() abuse going on here.
    [delayed_]work_pending() are unsynchronized and racy when used like
    this. I converted one instance in fullbatt_handler(). Please
    conver the rest so that it invokes workqueue APIs for the intended
    target state rather than trying to game work item pending state
    transitions. e.g. if timer should be modified - call
    mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

    * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
    simplified. Note that round_jiffies() calls in this function are
    meaningless. round_jiffies() work on absolute jiffies not delta
    delay used by delayed_work.

    v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work(). They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock. __cancel_delayed_work() users are
    dropped.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Acked-by: Dmitry Torokhov
    Acked-by: Anton Vorontsov
    Acked-by: David Howells
    Cc: Tomi Valkeinen
    Cc: Jens Axboe
    Cc: Jiri Kosina
    Cc: Doug Thompson
    Cc: David Airlie
    Cc: Roland Dreier
    Cc: "John W. Linville"
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: "J. Bruce Fields"
    Cc: Johannes Berg

    Tejun Heo
     

03 Aug, 2012

1 commit

  • I met a odd prblem:read /proc/partitions may return zero.

    I wrote a file test.c:
    int main()
    {
    char buff[4096];
    int ret;
    int fd;
    printf("pid=%d\n",getpid());
    while (1) {
    fd = open("/proc/partitions", O_RDONLY);
    if (fd < 0) {
    printf("open error %s\n", strerror(errno));
    return 0;
    }
    ret = read(fd, buff, 4096);
    if (ret /dev/null ;done
    2:./test

    I reviewed the code and found:

    >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
    >> {
    >> static void *p;
    >>
    >> p = disk_seqf_start(seqf, pos);
    >> if (!IS_ERR_OR_NULL(p) && !*pos)
    >> seq_puts(seqf, "major minor #blocks name\n\n");
    >> return p;
    >> }
    test cat /proc/partitions
    p = disk_seqf_start()(Not NULL)
    p = disk_seqf_start()(NULL because pos)
    if (!IS_ERR_OR_NULL(p) && !*pos)

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

01 Aug, 2012

1 commit

  • Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
    allows altering the size of an existing partition, even if it is currently
    in use.

    This patch converts hd_struct->nr_sects into sequence counter because
    One might extend a partition while IO is happening to it and update of
    nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
    can lead to issues like reading inconsistent size of a partition. Sequence
    counter have been used so that readers don't have to take bdev mutex lock
    as we call sector_in_part() very frequently.

    Now all the access to hd_struct->nr_sects should happen using sequence
    counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
    There is one exception though, set_capacity()/get_capacity(). I think
    theoritically race should exist there too but this patch does not
    modify set_capacity()/get_capacity() due to sheer number of call sites
    and I am afraid that change might break something. I have left that as a
    TODO item. We can handle it later if need be. This patch does not introduce
    any new races as such w.r.t set_capacity()/get_capacity().

    v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Phillip Susi
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

15 May, 2012

1 commit

  • 6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct"
    added part_unpack_uuid() which assumes that the passed in buffer has
    enough space for sprintfing "%pU" - 37 characters including '\0'.

    Unfortunately, b5af921ec0233 "init: add support for root devices
    specified by partition UUID" supplied 33 bytes buffer to the function
    leading to the following panic with stackprotector enabled.

    Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e

    [] panic+0xba/0x1c6
    [] ? printk_all_partitions+0x259/0x26xb
    [] __stack_chk_fail+0x1b/0x20
    [] printk_all_paritions+0x259/0x26xb
    [] mount_block_root+0x1bc/0x27f
    [] mount_root+0x57/0x5b
    [] prepare_namespace+0x13d/0x176
    [] ? release_tgcred.isra.4+0x330/0x30
    [] kernel_init+0x155/0x15a
    [] ? schedule_tail+0x27/0xb0
    [] kernel_thread_helper+0x5/0x10
    [] ? start_kernel+0x3c5/0x3c5
    [] ? gs_change+0x13/0x13

    Increase the buffer size, remove the dangerous part_unpack_uuid() and
    use snprintf() directly from printk_all_partitions().

    Signed-off-by: Tejun Heo
    Reported-by: Szymon Gruszczynski
    Cc: Will Drewry
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Mar, 2012

2 commits

  • This patch (as1519) fixes a bug in the block layer's disk-events
    polling. The polling is done by a work routine queued on the
    system_nrt_wq workqueue. Since that workqueue isn't freezable, the
    polling continues even in the middle of a system sleep transition.

    Obviously, polling a suspended drive for media changes and such isn't
    a good thing to do; in the case of USB mass-storage devices it can
    lead to real problems requiring device resets and even re-enumeration.

    The patch fixes things by creating a new system-wide, non-reentrant,
    freezable workqueue and using it for disk-events polling.

    Signed-off-by: Alan Stern
    CC:
    Acked-by: Tejun Heo
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Jens Axboe

    Alan Stern
     
  • The following situation might occur:

    __blkdev_get: add_disk:

    register_disk()
    get_gendisk()

    disk_block_events()
    disk->ev == NULL

    disk_add_events()

    __disk_unblock_events()
    disk->ev != NULL
    --ev->block

    Then we unblock events, when they are suppose to be blocked. This can
    trigger events related block/genhd.c warnings, but also can crash in
    sd_check_events() or other places.

    I'm able to reproduce crashes with the following scripts (with
    connected usb dongle as sdb disk).

    DEV=/dev/sdb
    ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue

    function stop_me()
    {
    for i in `jobs -p` ; do kill $i 2> /dev/null ; done
    exit
    }

    trap stop_me SIGHUP SIGINT SIGTERM

    for ((i = 0; i < 10; i++)) ; do
    while true; do fdisk -l $DEV 2>&1 > /dev/null ; done &
    done

    while true ; do
    echo 1 > $ENABLE
    sleep 1
    echo 0 > $ENABLE
    done

    I use the script to verify patch fixing oops in sd_revalidate_disk
    http://marc.info/?l=linux-scsi&m=132935572512352&w=2
    Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
    sd_revalidate_disk" or this one, script easily crash kernel within
    a few seconds. With both patches applied I do not observe crash.
    Unfortunately after some time (dozen of minutes), script will hung in:

    [ 1563.906432] [] schedule_timeout_uninterruptible+0x15/0x20
    [ 1563.906437] [] msleep+0x15/0x20
    [ 1563.906443] [] blk_drain_queue+0x32/0xd0
    [ 1563.906447] [] blk_cleanup_queue+0xd0/0x170
    [ 1563.906454] [] scsi_free_queue+0x3f/0x60
    [ 1563.906459] [] __scsi_remove_device+0x6e/0xb0
    [ 1563.906463] [] scsi_forget_host+0x4f/0x60
    [ 1563.906468] [] scsi_remove_host+0x5a/0xf0
    [ 1563.906482] [] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
    [ 1563.906490] [] usb_stor_disconnect+0x13/0x20 [usb_storage]

    Anyway I think this patch is some step forward.

    As drawback, I do not teardown on sysfs file create error, because I do
    not know how to nullify disk->ev (since it can be used). However add_disk
    error handling practically does not exist too, and things will work
    without this sysfs file, except events will not be exported to user
    space.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Stanislaw Gruszka
     

16 Jan, 2012

1 commit

  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     

07 Jan, 2012

1 commit


04 Jan, 2012

3 commits


14 Dec, 2011

1 commit

  • * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
    failure instead of 0 / -errno or boolean. Update it such that it
    returns %true on success and %false on failure.

    * Make sure the caller checks for the return value.

    * Separate out __blk_get_queue() which doesn't check whether @q is
    dead and put it in blk.h. This will be used later.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Nov, 2011

1 commit

  • This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10.

    The commit introduced alias for block devices which is intended to be
    used during logging although actual usage hasn't been committed yet.
    This approach adds very limited benefit (raw log might be easier to
    follow) which can be trivially implemented in userland but has a lot
    of problems.

    It is much worse than netif renames because it doesn't rename the
    actual device but just adds conveninence name which isn't used
    universally or enforced. Everything internal including device lookup
    and sysfs still uses the internal name and nothing prevents two
    devices from using conflicting alias - ie. sda can have sdb as its
    alias.

    This has been nacked by people working on device driver core, block
    layer and kernel-userland interface and shouldn't have been
    upstreamed. Revert it.

    http://thread.gmane.org/gmane.linux.kernel/1155104
    http://thread.gmane.org/gmane.linux.scsi/68632
    http://thread.gmane.org/gmane.linux.scsi/69776

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Acked-by: Kay Sievers
    Cc: "James E.J. Bottomley"
    Cc: Nao Nishijima
    Cc: Alan Cox
    Cc: Al Viro
    Signed-off-by: Jens Axboe

    Tejun Heo
     

05 Nov, 2011

2 commits

  • * 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
    virtio-blk: use ida to allocate disk index
    hpsa: add small delay when using PCI Power Management to reset for kump
    cciss: add small delay when using PCI Power Management to reset for kump
    xen/blkback: Fix two races in the handling of barrier requests.
    xen/blkback: Check for proper operation.
    xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
    xen/blkback: Report VBD_WSECT (wr_sect) properly.
    xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
    xen-blkfront: plug device number leak in xlblk_init() error path
    xen-blkfront: If no barrier or flush is supported, use invalid operation.
    xen-blkback: use kzalloc() in favor of kmalloc()+memset()
    xen-blkback: fixed indentation and comments
    xen-blkfront: fix a deadlock while handling discard response
    xen-blkfront: Handle discard requests.
    xen-blkback: Implement discard requests ('feature-discard')
    xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
    drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
    drivers/block/loop.c: emit uevent on auto release
    drivers/block/cpqarray.c: use pci_dev->revision
    loop: always allow userspace partitions and optionally support automatic scanning
    ...

    Fic up trivial header file includsion conflict in drivers/block/loop.c

    Linus Torvalds
     
  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

19 Oct, 2011

1 commit

  • The following command sequence triggers an oops.

    # mount /dev/sdb1 /mnt
    # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
    # umount /mnt

    general protection fault: 0000 [#1] PREEMPT SMP
    CPU 2
    Modules linked in:

    Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
    RIP: 0010:[] [] __lock_acquire+0x389/0x1d60
    ...
    Call Trace:
    [] lock_acquire+0x95/0x140
    [] _raw_spin_lock+0x3b/0x50
    [] bdi_lock_two+0x5c/0x70
    [] bdev_inode_switch_bdi+0x4c/0xf0
    [] __blkdev_put+0x11b/0x1d0
    [] __blkdev_put+0x160/0x1d0
    [] blkdev_put+0x5f/0x190
    [] kill_block_super+0x4d/0x80
    [] deactivate_locked_super+0x45/0x70
    [] deactivate_super+0x4a/0x70
    [] mntput_no_expire+0xed/0x130
    [] sys_umount+0x7e/0x3a0
    [] system_call_fastpath+0x16/0x1b

    This is because bdev holds on to disk but disk doesn't pin the
    associated queue. If a SCSI device is removed while the device is
    still open, the sdev puts the base reference to the queue on release.
    When the bdev is finally released, the associated queue is already
    gone along with the bdi and bdev_inode_switch_bdi() ends up
    dereferencing already freed bdi.

    Even if it were not for this bug, disk not holding onto the associated
    queue is very unusual and error-prone.

    Fix it by making add_disk() take an extra reference to its queue and
    put it on disk_release() and ensuring that disk and its fops owner are
    put in that order after all accesses to the disk and queue are
    complete.

    Signed-off-by: Tejun Heo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 Aug, 2011

1 commit

  • This patch allows the user to set an "alias" of the disk via sysfs interface.

    This patch only adds a new attribute "alias" in gendisk structure.
    To show the alias instead of the device name in kernel messages,
    we need to revise printk messages and use alias_name() in them.

    Example:
    (current) printk("disk name is %s\n", disk->disk_name);
    (new) printk("disk name is %s\n", alias_name(disk));

    Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
    have an "alias" which length is up to 255 bytes. This attribute is write-once.

    Suggested-by: James Bottomley
    Suggested-by: Jon Masters
    Signed-off-by: Nao Nishijima
    Signed-off-by: James Bottomley

    Nao Nishijima
     

24 Aug, 2011

1 commit

  • There are cases where suppressing partition scan is useful - e.g. for
    lo devices and pseudo SATA devices which advertise to be a disk but
    get upset on partition scan (some port multiplier control devices show
    such behavior).

    This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
    regardless of the number of possible partitions. disk_partitionable()
    is renamed to disk_part_scan_enabled() as suppressing partition scan
    doesn't imply the device can't be partitioned using
    BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
    directly tests disk_max_parts() to maintain backward-compatibility.

    -v2: Updated to make it clear that only partition scan is suppressed
    not partitioning itself as suggested by Kay Sievers.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Aug, 2011

1 commit

  • Remove the (unsigned long long) cast in diskstats_show() and adjusts the
    seq_printf() format string to 'unsigned long'

    diskstats_show() uses part_stat_read() to get the stats, which either
    accesses the specified field in the struct disk_stats directly (non SMP)
    or sums up the per CPU values in a variable of the same type as the field,
    so in any case the result will have the same type and range as the
    specified field which for all disk_stats entries is unsigned long

    Also, for unsigned long ranges the output of %lu should be identical to
    the one of %llu, so no change in the actual proc entry contents.

    Signed-off-by: Herbert Poetzl
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Herbert Poetzl