07 Mar, 2015

3 commits

  • commit 045c47ca306acf30c740c285a77a4b4bda6be7c5 upstream.

    When reading blkio.throttle.io_serviced in a recently created blkio
    cgroup, it's possible to race against the creation of a throttle policy,
    which delays the allocation of stats_cpu.

    Like other functions in the throttle code, just checking for a NULL
    stats_cpu prevents the following oops caused by that race.

    [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
    [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
    [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
    [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
    [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
    [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
    [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
    [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0)
    [ 1137.734230] MSR: 9000000000009032 CR: 42008884 XER: 20000000
    [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
    GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
    GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
    GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
    GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
    GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
    [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
    [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
    [ 1137.734943] Call Trace:
    [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
    [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
    [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
    [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
    [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
    [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
    [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
    [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
    [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
    [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
    [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
    [ 1137.735383] Instruction dump:
    [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
    [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 e9090008 e9490010 e9290018

    And here is one code that allows to easily reproduce this, although this
    has first been found by running docker.

    void run(pid_t pid)
    {
    int n;
    int status;
    int fd;
    char *buffer;
    buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
    n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
    fd = open(CGPATH "/test/tasks", O_WRONLY);
    write(fd, buffer, n);
    close(fd);
    if (fork() > 0) {
    fd = open("/dev/sda", O_RDONLY | O_DIRECT);
    read(fd, buffer, 512);
    close(fd);
    wait(&status);
    } else {
    fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
    n = read(fd, buffer, BUFFER_SIZE);
    close(fd);
    }
    free(buffer);
    exit(0);
    }

    void test(void)
    {
    int status;
    mkdir(CGPATH "/test", 0666);
    if (fork() > 0)
    wait(&status);
    else
    run(getpid());
    rmdir(CGPATH "/test");
    }

    int main(int argc, char **argv)
    {
    int i;
    for (i = 0; i < NR_TESTS; i++)
    test();
    return 0;
    }

    Reported-by: Ricardo Marin Matinata
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Thadeu Lima de Souza Cascardo
     
  • commit c6ce194325cef342313e3d27620411ce90a89c50 upstream.

    Hi,

    If you can manage to submit an async write as the first async I/O from
    the context of a process with realtime scheduling priority, then a
    cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It
    ends up in the best effort array, but actually has realtime I/O
    scheduling priority set in cfqq->ioprio.

    The reason is that cfq_get_queue assumes the default scheduling class and
    priority when there is no information present (i.e. when the async cfqq
    is created):

    static struct cfq_queue *
    cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
    struct bio *bio, gfp_t gfp_mask)
    {
    const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
    const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);

    cic->ioprio starts out as 0, which is "invalid". So, class of 0
    (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:

    async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);

    static struct cfq_queue **
    cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
    {
    switch (ioprio_class) {
    case IOPRIO_CLASS_RT:
    return &cfqd->async_cfqq[0][ioprio];
    case IOPRIO_CLASS_NONE:
    ioprio = IOPRIO_NORM;
    /* fall through */
    case IOPRIO_CLASS_BE:
    return &cfqd->async_cfqq[1][ioprio];
    case IOPRIO_CLASS_IDLE:
    return &cfqd->async_idle_cfqq;
    default:
    BUG();
    }
    }

    Here, instead of returning a class mapped from the process' scheduling
    priority, we get back the bucket associated with IOPRIO_CLASS_BE.

    Now, there is no queue allocated there yet, so we create it:

    cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);

    That function ends up doing this:

    cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
    cfq_init_prio_data(cfqq, cic);

    cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio
    data does this:

    ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
    switch (ioprio_class) {
    default:
    printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
    case IOPRIO_CLASS_NONE:
    /*
    * no prio set, inherit CPU scheduling settings
    */
    cfqq->ioprio = task_nice_ioprio(tsk);
    cfqq->ioprio_class = task_nice_ioclass(tsk);
    break;

    So we basically have two code paths that treat IOPRIO_CLASS_NONE
    differently, which results in an RT async cfqq filed into a best effort
    bucket.

    Attached is a patch which fixes the problem. I'm not sure how to make
    it cleaner. Suggestions would be welcome.

    Signed-off-by: Jeff Moyer
    Tested-by: Hidehiro Kawai
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     
  • commit 69abaffec7d47a083739b79e3066cb3730eba72e upstream.

    Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
    In cfq_find_alloc_queue() possible allocation failure is not handled.
    As a result kernel oopses on NULL pointer dereference when
    cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.

    Bug was introduced in v3.5 in commit cd1604fab4f9 ("blkcg: factor
    out blkio_group creation"). Prior to that commit cfq group lookup
    had returned pointer to root group as fallback.

    This patch handles this error using existing fallback oom_cfqq.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Tejun Heo
    Acked-by: Vivek Goyal
    Fixes: cd1604fab4f9 ("blkcg: factor out blkio_group creation")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

16 Jan, 2015

2 commits

  • commit 5fabcb4c33fe11c7e3afdf805fde26c1a54d0953 upstream.

    We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
    with a user passed in partno value. If we pass in 0x7fffffff, the
    new target in disk_expand_part_tbl() overflows the 'int' and we
    access beyond the end of ptbl->part[] and even write to it when we
    do the rcu_assign_pointer() to assign the new partition.

    Reported-by: David Ramos
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit a33c1ba2913802b6fb23e974bb2f6a4e73c8b7ce upstream.

    We currently use num_possible_cpus(), but that breaks on sparc64 where
    the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID
    instead, so we don't end up reading from invalid memory.

    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

15 Nov, 2014

2 commits

  • commit 84ce0f0e94ac97217398b3b69c21c7a62ebeed05 upstream.

    When sg_scsi_ioctl() fails to prepare request to submit in
    blk_rq_map_kern() we jump to a label where we just end up copying
    (luckily zeroed-out) kernel buffer to userspace instead of reporting
    error. Fix the problem by jumping to the right label.

    CC: Jens Axboe
    CC: linux-scsi@vger.kernel.org
    Coverity-id: 1226871
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Fixed up the, now unused, out label.

    Signed-off-by: Jens Axboe

    Jan Kara
     
  • commit b8839b8c55f3fdd60dc36abcda7e0266aff7985c upstream.

    The math in both blk_stack_limits() and queue_limit_alignment_offset()
    assume that a block device's io_min (aka minimum_io_size) is always a
    power-of-2. Fix the math such that it works for non-power-of-2 io_min.

    This issue (of alignment_offset != 0) became apparent when testing
    dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
    1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
    block size") unlocked the potential for alignment_offset != 0 due to
    the dm-thin-pool's io_min possibly being a non-power-of-2.

    Signed-off-by: Mike Snitzer
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

06 Oct, 2014

4 commits

  • commit d97a86c170b4e432f76db072a827fe30b4d6f659 upstream.

    The lvip[] array has "state->limit" elements so the condition here
    should be >= instead of >.

    Fixes: 6ceea22bbbc8 ('partitions: add aix lvm partition support files')
    Signed-off-by: Dan Carpenter
    Acked-by: Philippe De Muyter
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit 46f341ffcfb5d8530f7d1e60f3be06cce6661b62 upstream.

    Commit 2da78092 changed the locking from a mutex to a spinlock,
    so we now longer sleep in this context. But there was a leftover
    might_sleep() in there, which now triggers since we do the final
    free from an RCU callback. Get rid of it.

    Reported-by: Pontus Fuchs
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit 2da78092dda13f1efd26edbbf99a567776913750 upstream.

    Releases the dev_t minor when all references are closed to prevent
    another device from acquiring the same major/minor.

    Since the partition's release may be invoked from call_rcu's soft-irq
    context, the ext_dev_idr's mutex had to be replaced with a spinlock so
    as not so sleep.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • commit e15693ef18e13e3e6bffe891fe140f18b8ff6d07 upstream.

    cfq_group_service_tree_add() is applying new_weight at the beginning of
    the function via cfq_update_group_weight().
    This actually allows weight to change between adding it to and subtracting
    it from children_weight, and triggers WARN_ON_ONCE() in
    cfq_group_service_tree_del(), or even causes oops by divide error during
    vfr calculation in cfq_group_service_tree_add().

    The detailed scenario is as follows:
    1. Create blkio cgroups X and Y as a child of X.
    Set X's weight to 500 and perform some I/O to apply new_weight.
    This X's I/O completes before starting Y's I/O.
    2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
    3. cfq_group_service_tree_add() walks up the tree during children_weight
    calculation and adds parent X's weight (500) to children_weight of root.
    children_weight becomes 500.
    4. Set X's weight to 1000.
    5. X starts I/O and cfq_group_service_tree_add() is called with X.
    6. cfq_group_service_tree_add() applies its new_weight (1000).
    7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
    8. I/O of X completes and cfq_group_service_tree_del() is called with X.
    9. cfq_group_service_tree_del() subtracts X's weight (1000) from
    children_weight of root. children_weight becomes -500.
    This triggers WARN_ON_ONCE().
    10. Set X's weight to 500.
    11. X starts I/O and cfq_group_service_tree_add() is called with X.
    12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
    to children_weight of root. children_weight becomes 0. Calcularion of
    vfr triggers oops by divide error.

    weight should be updated right before adding it to children_weight.

    Reported-by: Ruki Sekiya
    Signed-off-by: Toshiaki Makita
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Toshiaki Makita
     

01 Aug, 2014

3 commits

  • commit 0b462c89e31f7eb6789713437eb551833ee16ff3 upstream.

    While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit d45b3279a5a2252cafcd665bbf2db8c9b31ef783 upstream.

    There is no inherent reason why the last put of a tag structure must be
    the one for the Scsi_Host, as device model objects can be held for
    arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
    funtion that just release a references and get rid of the BUG() when the
    host reference wasn't the last.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit 3b3a1814d1703027f9867d0f5cbbfaf6c7482474 upstream.

    This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
    to two uint64_t values, so there is no need to translate it.

    Signed-off-by: Mikulas Patocka
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

10 Jul, 2014

1 commit

  • commit a5049a8ae34950249a7ae94c385d7c5c98914412 upstream.

    Hello,

    So, this patch should do. Joe, Vivek, can one of you guys please
    verify that the oops goes away with this patch?

    Jens, the original thread can be read at

    http://thread.gmane.org/gmane.linux.kernel/1720729

    The fix converts blkg->refcnt from int to atomic_t. It does some
    overhead but it should be minute compared to everything else which is
    going on and the involved cacheline bouncing, so I think it's highly
    unlikely to cause any noticeable difference. Also, the refcnt in
    question should be converted to a perpcu_ref for blk-mq anyway, so the
    atomic_t is likely to go away pretty soon anyway.

    Thanks.

    ------- 8< -------
    __blkg_release_rcu() may be invoked after the associated request_queue
    is released with a RCU grace period inbetween. As such, the function
    and callbacks invoked from it must not dereference the associated
    request_queue. This is clearly indicated in the comment above the
    function.

    Unfortunately, while trying to fix a different issue, 2a4fd070ee85
    ("blkcg: move bulk of blkcg_gq release operations to the RCU
    callback") ignored this and added [un]locking of @blkg->q->queue_lock
    to __blkg_release_rcu(). This of course can cause oops as the
    request_queue may be long gone by the time this code gets executed.

    general protection fault: 0000 [#1] SMP
    CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
    Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
    task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
    RIP: 0010:[] [] _raw_spin_lock_irq+0x15/0x60
    RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
    RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
    RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
    RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
    R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
    R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
    Stack:
    ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
    ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
    ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
    Call Trace:
    [] __blkg_release_rcu+0x72/0x150
    [] rcu_nocb_kthread+0x1e8/0x300
    [] kthread+0xe1/0x100
    [] ret_from_fork+0x7c/0xb0
    Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
    +fa 66 66 90 66 66 90 b8 00 00 02 00 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
    +b7
    RIP [] _raw_spin_lock_irq+0x15/0x60
    RSP

    The request_queue locking was added because blkcg_gq->refcnt is an int
    protected with the queue lock and __blkg_release_rcu() needs to put
    the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
    dropping queue locking in the function.

    Given the general heavy weight of the current request_queue and blkcg
    operations, this is unlikely to cause any noticeable overhead.
    Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
    the near future, so whatever (most likely negligible) overhead it may
    add is temporary.

    Signed-off-by: Tejun Heo
    Reported-by: Joe Lawrence
    Acked-by: Vivek Goyal
    Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

01 Jun, 2014

1 commit

  • commit af5040da01ef980670b3741b3e10733ee3e33566 upstream.

    trace_block_rq_complete does not take into account that request can
    be partially completed, so we can get the following incorrect output
    of blkparser:

    C R 232 + 240 [0]
    C R 240 + 232 [0]
    C R 248 + 224 [0]
    C R 256 + 216 [0]

    but should be:

    C R 232 + 8 [0]
    C R 240 + 8 [0]
    C R 248 + 8 [0]
    C R 256 + 8 [0]

    Also, the whole output summary statistics of completed requests and
    final throughput will be incorrect.

    This patch takes into account real completion size of the request and
    fixes wrong completion accounting.

    Signed-off-by: Roman Pen
    CC: Steven Rostedt
    CC: Frederic Weisbecker
    CC: Ingo Molnar
    CC: linux-kernel@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Roman Pen
     

21 Mar, 2014

1 commit


09 Mar, 2014

2 commits

  • Commit 18741986 inadvertently changed the rq flush insertion
    from a head to a tail insertion. Fix that back up.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Commit 1874198 ("blk-mq: rework flush sequencing logic") switched
    ->flush_rq from being an embedded member of the request_queue structure
    to being dynamically allocated in blk_init_queue_node().

    Request-based DM multipath doesn't use blk_init_queue_node(), instead it
    uses blk_alloc_queue_node() + blk_init_allocated_queue(). Because
    commit 1874198 placed the dynamic allocation of ->flush_rq in
    blk_init_queue_node() any flush issued to a dm-mpath device would crash
    with a NULL pointer, e.g.:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] blk_rq_init+0x1e/0xb0
    PGD bb3c7067 PUD bb01d067 PMD 0
    Oops: 0002 [#1] SMP
    ...
    CPU: 5 PID: 5028 Comm: dt Tainted: G W O 3.14.0-rc3.snitm+ #10
    ...
    task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000
    RIP: 0010:[] [] blk_rq_init+0x1e/0xb0
    RSP: 0018:ffff880079565c98 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
    RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001
    R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000
    R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000
    FS: 00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0
    Stack:
    0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47
    ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000
    ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8
    Call Trace:
    [] blk_flush_complete_seq+0x2d7/0x2e0
    [] blk_insert_flush+0x1dd/0x210
    [] __elv_add_request+0x1f9/0x320
    [] ? blk_account_io_start+0x111/0x190
    [] blk_queue_bio+0x25b/0x330
    [] dm_request+0x35/0x40 [dm_mod]
    [] generic_make_request+0xc0/0x100
    [] submit_bio+0x73/0x140
    [] submit_bio_wait+0x5d/0x80
    [] blkdev_issue_flush+0x78/0xa0
    [] blkdev_fsync+0x3f/0x60
    [] vfs_fsync_range+0x1e/0x20
    [] vfs_fsync+0x1c/0x20
    [] do_fsync+0x41/0x80
    [] ? SyS_lseek+0x7e/0x80
    [] SyS_fsync+0x10/0x20
    [] system_call_fastpath+0x16/0x1b

    Fix this by moving the ->flush_rq allocation from blk_init_queue_node()
    to blk_init_allocated_queue(). blk_init_queue_node() also calls
    blk_init_allocated_queue() so this change is functionality equivalent
    for all blk_init_queue_node() callers.

    Reported-by: Hannes Reinecke
    Reported-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

07 Mar, 2014

1 commit


04 Mar, 2014

1 commit

  • [ 365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674
    [ 365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1
    [ 365.164043] no locks held by migration/1/26.
    [ 365.164044] irq event stamp: 6648
    [ 365.164056] hardirqs last enabled at (6647): [] restore_args+0x0/0x30
    [ 365.164062] hardirqs last disabled at (6648): [] multi_cpu_stop+0x9d/0x120
    [ 365.164070] softirqs last enabled at (0): [] copy_process.part.28+0x6fc/0x1920
    [ 365.164072] softirqs last disabled at (0): [< (null)>] (null)
    [ 365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF N 3.12.12-rt19-0.gcb6c4a2-rt #3
    [ 365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011
    [ 365.164091] 0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0
    [ 365.164099] ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f
    [ 365.164107] ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0
    [ 365.164108] Call Trace:
    [ 365.164119] [] try_stack_unwind+0x191/0x1a0
    [ 365.164127] [] dump_trace+0x92/0x360
    [ 365.164133] [] show_trace_log_lvl+0x48/0x60
    [ 365.164138] [] show_stack_log_lvl+0xd8/0x1d0
    [ 365.164143] [] show_stack+0x20/0x50
    [ 365.164153] [] dump_stack+0x54/0x9a
    [ 365.164163] [] __might_sleep+0xfc/0x140
    [ 365.164173] [] rt_spin_lock+0x1f/0x70
    [ 365.164182] [] blk_mq_main_cpu_notify+0x20/0x70
    [ 365.164191] [] notifier_call_chain+0x4c/0x70
    [ 365.164201] [] __raw_notifier_call_chain+0x9/0x10
    [ 365.164207] [] cpu_notify+0x1e/0x40
    [ 365.164217] [] take_cpu_down+0x22/0x40
    [ 365.164223] [] multi_cpu_stop+0xd6/0x120
    [ 365.164229] [] cpu_stopper_thread+0xd7/0x1e0
    [ 365.164235] [] smpboot_thread_fn+0x203/0x380
    [ 365.164241] [] kthread+0xc8/0xd0
    [ 365.164250] [] ret_from_fork+0x7c/0xb0
    [ 365.164429] smpboot: CPU 1 is now offline

    Signed-off-by: Mike Galbraith
    Signed-off-by: Jens Axboe

    Mike Galbraith
     

22 Feb, 2014

3 commits


15 Feb, 2014

1 commit

  • Pull block IO fixes from Jens Axboe:
    "Second round of updates and fixes for 3.14-rc2. Most of this stuff
    has been queued up for a while. The notable exception is the blk-mq
    changes, which are naturally a bit more in flux still.

    The pull request contains:

    - Two bug fixes for the new immutable vecs, causing crashes with raid
    or swap. From Kent.

    - Various blk-mq tweaks and fixes from Christoph. A fix for
    integrity bio's from Nic.

    - A few bcache fixes from Kent and Darrick Wong.

    - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas
    Swenson, and Roger Pau Monne.

    - Fix for a vec miscount with integrity vectors from Martin.

    - Minor annotations or fixes from Masanari Iida and Rashika Kheria.

    - Tweak to null_blk to do more normal FIFO processing of requests
    from Shlomo Pongratz.

    - Elevator switching bypass fix from Tejun.

    - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from
    me"

    * 'for-linus' of git://git.kernel.dk/linux-block: (31 commits)
    block: add cond_resched() to potentially long running ioctl discard loop
    xen-blkback: init persistent_purge_work work_struct
    blk-mq: pair blk_mq_start_request / blk_mq_requeue_request
    blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq
    block: Fix cloning of discard/write same bios
    block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show
    blk-mq: rework flush sequencing logic
    null_blk: use blk_complete_request and blk_mq_complete_request
    virtio_blk: use blk_mq_complete_request
    blk-mq: rework I/O completions
    fs: Add prototype declaration to appropriate header file include/linux/bio.h
    fs: Mark function as static in fs/bio-integrity.c
    block/null_blk: Fix completion processing from LIFO to FIFO
    block: Explicitly handle discard/write same segments
    block: Fix nr_vecs for inline integrity vectors
    blk-mq: Add bio_integrity setup to blk_mq_make_request
    blk-mq: initialize sg_reserved_size
    blk-mq: handle dma_drain_size
    blk-mq: divert __blk_put_request for MQ ops
    blk-mq: support at_head inserations for blk_execute_rq
    ...

    Linus Torvalds
     

13 Feb, 2014

1 commit

  • When mkfs issues a full device discard and the device only
    supports discards of a smallish size, we can loop in
    blkdev_issue_discard() for a long time. If preempt isn't enabled,
    this can turn into a softlock situation and the kernel will
    start complaining.

    Add an explicit cond_resched() at the end of the loop to avoid
    that.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Feb, 2014

2 commits

  • Make sure we have a proper pairing between starting and requeueing
    requests. Move the dma drain and REQ_END setup into blk_mq_start_request,
    and make sure blk_mq_requeue_request properly undoes them, giving us
    a pair of function to prepare and unprepare a request without leaving
    side effects.

    Together this ensures we always clean up properly after
    BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • rq->errors never has been part of the communication protocol between drivers
    and the block stack and most drivers will not have initialized it.

    Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR
    unconditionally. If a driver want to return a different error it can easily
    do so by returning success after calling blk_mq_end_io itself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Feb, 2014

3 commits

  • cppcheck detected following format string mismatch.
    [blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires
    'unsigned int' but the argument type is 'int'.

    Change "cpu" from int to unsigned int, because the cpu
    never become minus value.

    Signed-off-by: Masanari Iida
    Signed-off-by: Jens Axboe

    Masanari Iida
     
  • Witch to using a preallocated flush_rq for blk-mq similar to what's done
    with the old request path. This allows us to set up the request properly
    with a tag from the actually allowed range and ->rq_disk as needed by
    some drivers. To make life easier we also switch to dynamic allocation
    of ->flush_rq for the old path.

    This effectively reverts most of

    "blk-mq: fix for flush deadlock"

    and

    "blk-mq: Don't reserve a tag for flush request"

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Rework I/O completions to work more like the old code path. blk_mq_end_io
    now stays out of the business of deferring completions to others CPUs
    and calling blk_mark_rq_complete. The latter is very important to allow
    completing requests that have timed out and thus are already marked completed,
    the former allows using the IPI callout even for driver specific completions
    instead of having to reimplement them.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Feb, 2014

6 commits


01 Feb, 2014

1 commit

  • Pull SCSI target updates from Nicholas Bellinger:
    "The highlights this round include:

    - add support for SCSI Referrals (Hannes)
    - add support for T10 DIF into target core (nab + mkp)
    - add support for T10 DIF emulation in FILEIO + RAMDISK backends (Sagi + nab)
    - add support for T10 DIF -> bio_integrity passthrough in IBLOCK backend (nab)
    - prep changes to iser-target for >= v3.15 T10 DIF support (Sagi)
    - add support for qla2xxx N_Port ID Virtualization - NPIV (Saurav + Quinn)
    - allow percpu_ida_alloc() to receive task state bitmask (Kent)
    - fix >= v3.12 iscsi-target session reset hung task regression (nab)
    - fix >= v3.13 percpu_ref se_lun->lun_ref_active race (nab)
    - fix a long-standing network portal creation race (Andy)"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (51 commits)
    target: Fix percpu_ref_put race in transport_lun_remove_cmd
    target/iscsi: Fix network portal creation race
    target: Report bad sector in sense data for DIF errors
    iscsi-target: Convert gfp_t parameter to task state bitmask
    iscsi-target: Fix connection reset hang with percpu_ida_alloc
    percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
    iscsi-target: Pre-allocate more tags to avoid ack starvation
    qla2xxx: Configure NPIV fc_vport via tcm_qla2xxx_npiv_make_lport
    qla2xxx: Enhancements to enable NPIV support for QLOGIC ISPs with TCM/LIO.
    qla2xxx: Fix scsi_host leak on qlt_lport_register callback failure
    IB/isert: pass scatterlist instead of cmd to fast_reg_mr routine
    IB/isert: Move fastreg descriptor creation to a function
    IB/isert: Avoid frwr notation, user fastreg
    IB/isert: seperate connection protection domains and dma MRs
    tcm_loop: Enable DIF/DIX modes in SCSI host LLD
    target/rd: Add DIF protection into rd_execute_rw
    target/rd: Add support for protection SGL setup + release
    target/rd: Refactor rd_build_device_space + rd_release_device_space
    target/file: Add DIF protection support to fd_execute_rw
    target/file: Add DIF protection init/format support
    ...

    Linus Torvalds
     

31 Jan, 2014

2 commits

  • request_queue bypassing is used to suppress higher-level function of a
    request_queue so that they can be switched, reconfigured and shut
    down. A request_queue does the followings while bypassing.

    * bypasses elevator and io_cq association and queues requests directly
    to the FIFO dispatch queue.

    * bypasses block cgroup request_list lookup and always uses the root
    request_list.

    Once confirmed to be bypassing, specific elevator and block cgroup
    policy implementations can assume that nothing is in flight for them
    and perform various operations which would be dangerous otherwise.

    Such confirmation is acheived by short-circuiting all new requests
    directly to the dispatch queue and waiting for all the requests which
    were issued before to finish. Unfortunately, while the request
    allocating and draining sides were properly handled, we forgot to
    actually plug the request dispatch path. Even after bypassing mode is
    confirmed, if the attached driver tries to fetch a request and the
    dispatch queue is empty, __elv_next_request() would invoke the current
    elevator's elevator_dispatch_fn() callback. As all in-flight requests
    were drained, the elevator wouldn't contain any request but once
    bypass is confirmed we don't even know whether the elevator is even
    there. It might be in the process of being switched and half torn
    down.

    Frank Mayhar reports that this actually happened while switching
    elevators, leading to an oops.

    Let's fix it by making __elv_next_request() avoid invoking the
    elevator_dispatch_fn() callback if the queue is bypassing. It already
    avoids invoking the callback if the queue is dying. As a dying queue
    is guaranteed to be bypassing, we can simply replace blk_queue_dying()
    check with blk_queue_bypass().

    Reported-by: Frank Mayhar
    References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com
    Cc: stable@vger.kernel.org
    Tested-by: Frank Mayhar

    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Reserving a tag (request) for flush to avoid dead lock is a overkill. A
    tag is valuable resource. We can track the number of flush requests and
    disallow having too many pending flush requests allocated. With this
    patch, blk_mq_alloc_request_pinned() could do a busy nop (but not a dead
    loop) if too many pending requests are allocated and new flush request
    is allocated. But this should not be a problem, too many pending flush
    requests are very rare case.

    I verified this can fix the deadlock caused by too many pending flush
    requests.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li