20 Mar, 2018

7 commits

  • This mirrors the blk-mq capabilities to allocate extra drivers-specific
    data behind struct request by setting a cmd_size field, as well as having
    a constructor / destructor for it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit 6d247d7f71d1fa4b66a5f4da7b1daa21510d529b)

    Christoph Hellwig
     
  • Return an errno value instead of the passed in queue so that the callers
    don't have to keep track of two queues, and move the assignment of the
    request_fn and lock to the caller as passing them as argument doesn't
    simplify anything. While we're at it also remove two pointless NULL
    assignments, given that the request structure is zeroed on allocation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit 5ea708d15a928f7a479987704203616d3274c03b)

    Conflicts:
    block/blk-core.c

    Christoph Hellwig
     
  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773)

    Conflicts:
    block/blk-mq.c
    include/linux/blk_types.h
    include/linux/blkdev.h

    Christoph Hellwig
     
  • A lot of the REQ_* flags are only used on struct requests, and only of
    use to the block layer and a few drivers that dig into struct request
    internals.

    This patch adds a new req_flags_t rq_flags field to struct request for
    them, and thus dramatically shrinks the number of common requests. It
    also removes the unfortunate situation where we have to fit the fields
    from the same enum into 32 bits for struct bio and 64 bits for
    struct request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe
    (cherry picked from commit e806402130c9c494e22c73ae9ead4e79d2a5811c)

    Conflicts:
    drivers/mmc/core/block.c
    drivers/scsi/sd_zbc.c

    Christoph Hellwig
     
  • It's the last bio-only REQ_* flag, and we have space for it in the bio
    bi_flags field.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe
    (cherry picked from commit 8d2bbd4c8236e9e38e6b36ac9e2c54fdcfe5b335)

    Christoph Hellwig
     
  • With the addition of the zoned operations the tests in this function
    became incorrect. But I think it's much better to just open code the
    allow operations in the only caller anyway.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe
    (cherry picked from commit c4aebd0332da831a3403faf2035af45059ab6b7c)

    Christoph Hellwig
     
  • errata:
    When a read command returns less data than specified in the PRDs (for
    example, there are two PRDs for this command, but the device returns a
    number of bytes which is less than in the first PRD), the second PRD of
    this command is not read out of the PRD FIFO, causing the next command
    to use this PRD erroneously.

    workaround
    - forces sg_tablesize = 1
    - modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
    allocated with dma_alloc_coherent during the probe in ahci_imx
    - In order to fix the scsi/sata hang, when CD_ROM and HDD are
    accessed simultaneously after the workaround is applied.
    Do not go to sleep in scsi_eh_handler, when there is host failed.

    Signed-off-by: Richard Zhu

    Richard Zhu
     

25 Feb, 2018

1 commit

  • commit 69e0927b3774563c19b5fb32e91d75edc147fb62 upstream.

    During stress tests by syzkaller on the sg driver the block layer
    infrequently returns EINVAL. Closer inspection shows the block
    layer was trying to return ENOMEM (which is much more
    understandable) but for some reason overroad that useful error.

    Patch below does not show this (unchanged) line:
    ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
    That 'ret' was being overridden when that function failed.

    Signed-off-by: Douglas Gilbert
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Douglas Gilbert
     

20 Dec, 2017

2 commits

  • [ Upstream commit 39b4954c0a1556f8f7f1fdcf59a227117fcd8a0b ]

    MD's rdev_set_badblocks() expects that badblocks_set() returns 1 if
    badblocks are disabled, otherwise, rdev_set_badblocks() will record
    superblock changes and return success in that case and md will fail to
    report an IO error which it should.

    This bug has existed since badblocks were introduced in commit
    9e0e252a048b ("badblocks: Add core badblock management code").

    Signed-off-by: Liu Bo
    Acked-by: Guoqing Jiang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0 ]

    In case cpu was unplugged, we need to make sure not to assume
    that the tags for that cpu are still allocated. so check
    for null tags when reinitializing a tagset.

    Reported-by: Yi Zhang
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

14 Dec, 2017

2 commits

  • [ Upstream commit 34d9715ac1edd50285168dd8d80c972739a4f6a4 ]

    Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
    blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
    if there are tasks blocked in get_request(), q->q_usage_counter can
    never become zero. So we have to wake up all these tasks in
    blk_set_queue_dying() first.

    Fixes: 3ef28e83ab157997 ("block: generic request_queue reference counting")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 737f98cfe7de8df7433a4d846850aa8efa44bd48 ]

    Both q->mq_kobj and sw queues' kobjects should have been initialized
    once, instead of doing that each add_disk context.

    Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
    because percpu allocator fills zero to allocated variable.

    This patch fixes one issue[1] reported from Omar.

    [1] kernel wearning when doing unbind/bind on one scsi-mq device

    [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
    [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
    [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
    [ 19.350920] Workqueue: events_unbound async_run_entry_fn
    [ 19.350920] Call Trace:
    [ 19.350920] dump_stack+0x63/0x83
    [ 19.350920] kobject_init+0x77/0x90
    [ 19.350920] blk_mq_register_dev+0x40/0x130
    [ 19.350920] blk_register_queue+0xb6/0x190
    [ 19.350920] device_add_disk+0x1ec/0x4b0
    [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
    [ 19.350920] async_run_entry_fn+0x48/0x150
    [ 19.350920] process_one_work+0x1d0/0x480
    [ 19.350920] worker_thread+0x48/0x4e0
    [ 19.350920] kthread+0x101/0x140
    [ 19.350920] ? process_one_work+0x480/0x480
    [ 19.350920] ? kthread_create_on_node+0x60/0x60
    [ 19.350920] ret_from_fork+0x2c/0x40

    Cc: Omar Sandoval
    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

30 Nov, 2017

1 commit

  • commit 4e9b6f20828ac880dbc1fa2fdbafae779473d1af upstream.

    Make sure that if the timeout timer fires after a queue has been
    marked "dying" that the affected requests are finished.

    Reported-by: chenxiang (M)
    Fixes: commit 287922eb0b18 ("block: defer timeouts to a workqueue")
    Signed-off-by: Bart Van Assche
    Tested-by: chenxiang (M)
    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

21 Oct, 2017

1 commit

  • This reverts commit eb4375e1969c48d454998b2a284c2e6a5dc9eb68 which was
    commit f507b54dccfd8000c517d740bc45f20c74532d18 upstream.

    Ben reports:
    That function doesn't exist here (it was introduced in 4.13).
    Instead, this backport has modified bsg_create_job(), creating a
    leak. Please revert this on the 3.18, 4.4 and 4.9 stable
    branches.

    So I'm dropping it from here.

    Reported-by: Ben Hutchings
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

    Greg Kroah-Hartman
     

18 Oct, 2017

3 commits

  • commit 1cfd0ddd82232804e03f3023f6a58b50dfef0574 upstream.

    Since "block: support large requests in blk_rq_map_user_iov" we
    started to call it with partially drained iter; that works fine
    on the write side, but reads create a copy of iter for completion
    time. And that needs to take the possibility of ->iov_iter != 0
    into account...

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 2b04e8f6bbb196cab4b232af0f8d48ff2c7a8058 upstream.

    we need to take care of failure exit as well - pages already
    in bio should be dropped by analogue of bio_unmap_pages(),
    since their refcounts had been bumped only once per reference
    in bio.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 95d78c28b5a85bacbc29b8dba7c04babb9b0d467 upstream.

    bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
    IO vector has small consecutive buffers belonging to the same page.
    bio_add_pc_page merges them into one, but the page reference is never
    dropped.

    Signed-off-by: Vitaly Mayatskikh
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Mayatskikh
     

08 Oct, 2017

1 commit

  • [ Upstream commit c5082b70adfe8e1ea1cf4a8eff92c9f260e364d2 ]

    If a GUID Partition Table claims to have more than 2**25 entries, the
    calculation of the partition table size in alloc_read_gpt_entries() will
    overflow a 32-bit integer and not enough space will be allocated for the
    table.

    Nothing seems to get written out of bounds, but later efi_partition() will
    read up to 32768 bytes from a 128 byte buffer, possibly OOPSing or exposing
    information to /proc/partitions and uevents.

    The problem exists on both 64-bit and 32-bit platforms.

    Fix the overflow and also print a meaningful debug message if the table
    size is too large.

    Signed-off-by: Alden Tondettar
    Acked-by: Ard Biesheuvel
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alden Tondettar
     

05 Oct, 2017

1 commit

  • commit f507b54dccfd8000c517d740bc45f20c74532d18 upstream.

    The job structure is allocated as part of the request, so we should not
    free it in the error path of bsg_prepare_job.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

27 Sep, 2017

1 commit

  • commit 4ddd56b003f251091a67c15ae3fe4a5c5c5e390a upstream.

    Calling blk_start_queue() from interrupt context with the queue
    lock held and without disabling IRQs, as the skd driver does, is
    safe. This patch avoids that loading the skd driver triggers the
    following warning:

    WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
    RIP: 0010:blk_start_queue+0x84/0xa0
    Call Trace:
    skd_unquiesce_dev+0x12a/0x1d0 [skd]
    skd_complete_internal+0x1e7/0x5a0 [skd]
    skd_complete_other+0xc2/0xd0 [skd]
    skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
    skd_isr+0x14f/0x180 [skd]
    irq_forced_thread_fn+0x2a/0x70
    irq_thread+0x144/0x1a0
    kthread+0x125/0x140
    ret_from_fork+0x2a/0x40

    Fixes: commit a038e2536472 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
    Signed-off-by: Bart Van Assche
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

25 Aug, 2017

1 commit

  • commit c005390374957baacbc38eef96ea360559510aa7 upstream.

    While pci_irq_get_affinity should never fail for SMP kernel that
    implement the affinity mapping, it will always return NULL in the
    UP case, so provide a fallback mapping of all queues to CPU 0 in
    that case.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

17 Jun, 2017

1 commit

  • commit 223220356d5ebc05ead9a8d697abb0c0a906fc81 upstream.

    The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD
    and NetBSD partitions and does a reasonable job picking out OpenBSD
    and NetBSD UFS subpartitions.

    But for FreeBSD the subpartitions are always "bad".

    Kernel:
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Richard
     

14 Jun, 2017

1 commit

  • commit 5be6b75610cefd1e21b98a218211922c2feb6e08 upstream.

    When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
    as the delay of cfq_group's vdisktime if there have been other cfq_groups
    already.

    When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
    from jiffies to nanoseconds") could result in a large iops delay and
    lead to an abnormal io schedule delay for the added cfq_group. To fix
    it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
    when iops mode is enabled.

    Despite having the same value, the delay of a cfq_queue in idle class
    and the delay of cfq_group are different things, so I define two new
    macros for the delay of a cfq_group under time-slice mode and iops mode.

    Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
    Signed-off-by: Hou Tao
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     

20 May, 2017

1 commit

  • commit 2859323e35ab5fc42f351fbda23ab544eaa85945 upstream.

    When registering an integrity profile: if the template's interval_exp is
    not 0 use it, otherwise use the ilog2() of logical block size of the
    provided gendisk.

    This fixes a long-standing DM linear target bug where it cannot pass
    integrity data to the underlying device if its logical block size
    conflicts with the underlying device's logical block size.

    Reported-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

14 May, 2017

1 commit

  • commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream.

    Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    introduced blk_integrity_revalidate(), which seems to assume ownership
    of the stable pages flag and unilaterally clears it if no blk_integrity
    profile is registered:

    if (bi->profile)
    disk->queue->backing_dev_info->capabilities |=
    BDI_CAP_STABLE_WRITES;
    else
    disk->queue->backing_dev_info->capabilities &=
    ~BDI_CAP_STABLE_WRITES;

    It's called from revalidate_disk() and rescan_partitions(), making it
    impossible to enable stable pages for drivers that support partitions
    and don't use blk_integrity: while the call in revalidate_disk() can be
    trivially worked around (see zram, which doesn't support partitions and
    hence gets away with zram_revalidate_disk()), rescan_partitions() can
    be triggered from userspace at any time. This breaks rbd, where the
    ceph messenger is responsible for generating/verifying CRCs.

    Since blk_integrity_{un,}register() "must" be used for (un)registering
    the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
    setting there. This way drivers that call blk_integrity_register() and
    use integrity infrastructure won't interfere with drivers that don't
    but still want stable pages.

    Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Tested-by: Dan Williams
    Signed-off-by: Ilya Dryomov
    [idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

18 Apr, 2017

1 commit

  • commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream.

    While stressing memory and IO at the same time we changed SMT settings,
    we were able to consistently trigger deadlocks in the mm system, which
    froze the entire machine.

    I think that under memory stress conditions, the large allocations
    performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
    waiting on the block layer remmaping completion, thus deadlocking the
    system. The trace below was collected after the machine stalled,
    waiting for the hotplug event completion.

    The simplest fix for this is to make allocations in this path
    non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the
    issue anymore.

    This should apply on top of Jens's for-next branch cleanly.

    Changes since v1:
    - Use GFP_NOIO instead of GFP_NOWAIT.

    Call Trace:
    [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
    [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
    [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
    [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
    [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
    [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
    [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
    [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
    [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
    [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
    [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
    [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
    [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
    [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
    [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
    [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
    [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
    [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
    [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
    [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
    [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
    [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
    [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
    [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
    [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
    [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
    [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
    [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
    [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
    [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
    [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
    [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
    [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
    [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
    [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
    [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
    [c000000f0160be30] [c000000000009204] system_call+0x38/0xec

    Signed-off-by: Gabriel Krisman Bertazi
    Cc: Brian King
    Cc: Douglas Miller
    Cc: linux-block@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sumit Semwal
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     

08 Apr, 2017

2 commits

  • commit f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f upstream.

    Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
    changed current->bio_list so that it did not contain *all* of the
    queued bios, but only those submitted by the currently running
    make_request_fn.

    There are two places which walk the list and requeue selected bios,
    and others that check if the list is empty. These are no longer
    correct.

    So redefine current->bio_list to point to an array of two lists, which
    contain all queued bios, and adjust various code to test or walk both
    lists.

    Signed-off-by: NeilBrown
    Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
    Signed-off-by: Jens Axboe
    Cc: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream.

    To avoid recursion on the kernel stack when stacked block devices
    are in use, generic_make_request() will, when called recursively,
    queue new requests for later handling. They will be handled when the
    make_request_fn for the current bio completes.

    If any bios are submitted by a make_request_fn, these will ultimately
    be handled seqeuntially. If the handling of one of those generates
    further requests, they will be added to the end of the queue.

    This strict first-in-first-out behaviour can lead to deadlocks in
    various ways, normally because a request might need to wait for a
    previous request to the same device to complete. This can happen when
    they share a mempool, and can happen due to interdependencies
    particular to the device. Both md and dm have examples where this happens.

    These deadlocks can be erradicated by more selective ordering of bios.
    Specifically by handling them in depth-first order. That is: when the
    handling of one bio generates one or more further bios, they are
    handled immediately after the parent, before any siblings of the
    parent. That way, when generic_make_request() calls make_request_fn
    for some particular device, we can be certain that all previously
    submited requests for that device have been completely handled and are
    not waiting for anything in the queue of requests maintained in
    generic_make_request().

    An easy way to achieve this would be to use a last-in-first-out stack
    instead of a queue. However this will change the order of consecutive
    bios submitted by a make_request_fn, which could have unexpected consequences.
    Instead we take a slightly more complex approach.
    A fresh queue is created for each call to a make_request_fn. After it completes,
    any bios for a different device are placed on the front of the main queue, followed
    by any bios for the same device, followed by all bios that were already on
    the queue before the make_request_fn was called.
    This provides the depth-first approach without reordering bios on the same level.

    This, by itself, it not enough to remove all deadlocks. It just makes
    it possible for drivers to take the extra step required themselves.

    To avoid deadlocks, drivers must never risk waiting for a request
    after submitting one to generic_make_request. This includes never
    allocing from a mempool twice in the one call to a make_request_fn.

    A common pattern in drivers is to call bio_split() in a loop, handling
    the first part and then looping around to possibly split the next part.
    Instead, a driver that finds it needs to split a bio should queue
    (with generic_make_request) the second part, handle the first part,
    and then return. The new code in generic_make_request will ensure the
    requests to underlying bios are processed first, then the second bio
    that was split off. If it splits again, the same process happens. In
    each case one bio will be completely handled before the next one is attempted.

    With this is place, it should be possible to disable the
    punt_bios_to_recover() recovery thread for many block devices, and
    eventually it may be possible to remove it completely.

    Ref: http://www.spinics.net/lists/raid/msg54680.html
    Tested-by: Jinpu Wang
    Inspired-by: Lars Ellenberg
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe
    Cc: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

30 Mar, 2017

1 commit

  • commit 95a49603707d982b25d17c5b70e220a05556a2f9 upstream.

    When iterating busy requests in timeout handler,
    if the STARTED flag of one request isn't set, that means
    the request is being processed in block layer or driver, and
    isn't submitted to hardware yet.

    In current implementation of blk_mq_check_expired(),
    if the request queue becomes dying, un-started requests are
    handled as being completed/freed immediately. This way is
    wrong, and can cause rq corruption or double allocation[1][2],
    when doing I/O and removing&resetting NVMe device at the sametime.

    This patch fixes several issues reported by Yi Zhang.

    [1]. oops log 1
    [ 581.789754] ------------[ cut here ]------------
    [ 581.789758] kernel BUG at block/blk-mq.c:374!
    [ 581.789760] invalid opcode: 0000 [#1] SMP
    [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
    edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
    irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
    intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
    intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
    pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
    sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
    syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
    crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
    dm_region_hash dm_log dm_mod
    [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
    [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
    [ 581.789804] Workqueue: kblockd blk_mq_timeout_work
    [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
    [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
    [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
    [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
    [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
    [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
    [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
    [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
    [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
    [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
    [ 581.789820] Call Trace:
    [ 581.789825] blk_mq_check_expired+0x76/0x80
    [ 581.789828] bt_iter+0x45/0x50
    [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
    [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
    [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
    [ 581.789840] ? __switch_to+0x140/0x450
    [ 581.789841] blk_mq_timeout_work+0x88/0x170
    [ 581.789845] process_one_work+0x165/0x410
    [ 581.789847] worker_thread+0x137/0x4c0
    [ 581.789851] kthread+0x101/0x140
    [ 581.789853] ? rescuer_thread+0x3b0/0x3b0
    [ 581.789855] ? kthread_park+0x90/0x90
    [ 581.789860] ret_from_fork+0x2c/0x40
    [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
    8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3
    0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
    [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
    [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

    [2]. oops log2
    [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
    [ 6984.857373] PGD 0
    [ 6984.857374]
    [ 6984.857376] Oops: 0000 [#1] SMP
    [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
    edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
    irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
    iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
    mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
    acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
    ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
    sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
    libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
    dm_region_hash dm_log dm_mod
    [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
    4.10.0-2.el7.bz1420297.x86_64 #1
    [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
    [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
    [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
    [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
    [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
    [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
    [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
    [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
    [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
    [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
    [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
    [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
    [ 6984.857443] Call Trace:
    [ 6984.857451] ? mempool_free+0x2b/0x80
    [ 6984.857455] ? bio_free+0x4e/0x60
    [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
    [ 6984.857462] blk_mq_process_rq_list+0x133/0x170
    [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
    [ 6984.857467] blk_mq_run_work_fn+0x12/0x20
    [ 6984.857473] process_one_work+0x165/0x410
    [ 6984.857475] worker_thread+0x137/0x4c0
    [ 6984.857478] kthread+0x101/0x140
    [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
    [ 6984.857481] ? kthread_park+0x90/0x90
    [ 6984.857489] ret_from_fork+0x2c/0x40
    [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
    89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff
    8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
    [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
    [ 6984.857512] CR2: 0000000000000010
    [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

22 Mar, 2017

1 commit

  • [ Upstream commit 25cdb64510644f3e854d502d69c73f21c6df88a9 ]

    The WRITE_SAME commands are not present in the blk_default_cmd_filter
    write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
    is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
    [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

    The problem can be reproduced with the sg_write_same command

    # sg_write_same --num 1 --xferlen 512 /dev/sda
    #

    # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    Write same: pass through os error: Operation not permitted
    #

    For comparison, the WRITE_VERIFY command does not observe this problem,
    since it is in that list:

    # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
    #

    So, this patch adds the WRITE_SAME commands to the list, in order
    for the SG_IO ioctl to finish successfully:

    # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    #

    That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
    (qemu "-device scsi-block" [1], libvirt "" [2]),
    which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

    In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
    which are translated to write-same calls in the guest kernel, and then into
    SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

    [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
    [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
    [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
    [...] blk_update_request: I/O error, dev sda, sector 17096824

    Links:
    [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
    [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Brahadambal Srinivasan
    Reported-by: Manjunatha H R
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mauricio Faria de Oliveira
     

20 Jan, 2017

2 commits

  • commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3 upstream.

    Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
    wrong CPU") attempts to avoid triggering the WARN_ON in
    __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the
    last batch execution before round robin, blk_mq_hctx_next_cpu can
    schedule a dead CPU and also update next_cpu to the next alive CPU in
    the mask, which will trigger the WARN_ON despite the previous
    workaround.

    The following patch fixes this scenario by always scheduling the value
    in hctx->next_cpu. This changes the moment when we round-robin the CPU
    running the hctx, but it really doesn't matter, since it still executes
    BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

    Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     
  • commit ebc4ff661fbe76781c6b16dfb7b754a5d5073f8e upstream.

    cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

09 Jan, 2017

1 commit

  • commit 128394eff343fc6d2f32172f03e24829539c5835 upstream.

    Both damn things interpret userland pointers embedded into the payload;
    worse, they are actually traversing those. Leaving aside the bad
    API design, this is very much _not_ safe to call with KERNEL_DS.
    Bail out early if that happens.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

06 Jan, 2017

1 commit

  • commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream.

    The meaning of the BLK_MQ_S_STOPPED flag is "do not call
    .queue_rq()". Hence modify blk_mq_make_request() such that requests
    are queued instead of issued if a queue has been stopped.

    Reported-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

08 Dec, 2016

1 commit


27 Oct, 2016

1 commit

  • If we end up sleeping due to running out of requests, we should
    update the hardware and software queues in the map ctx structure.
    Otherwise we could end up having rq->mq_ctx point to the pre-sleep
    context, and risk corrupting ctx->rq_list since we'll be
    grabbing the wrong lock when inserting the request.

    Reported-by: Dave Jones
    Reported-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Oct, 2016

1 commit


22 Oct, 2016

2 commits

  • When bandblocks_set acknowledges a range or badblocks_clear a range,
    it's possible all badblocks are acknowledged. We should update
    unacked_exist if this occurs.

    Signed-off-by: Shaohua Li
    Reviewed-by: Tomasz Majchrzak
    Tested-by: Tomasz Majchrzak
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Pull block fixes from Jens Axboe:
    "A set of fixes that missed the merge window, mostly due to me being
    away around that time.

    Nothing major here, a mix of nvme cleanups and fixes, and one fix for
    the badblocks handling"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvmet: use symbolic constants for CNS values
    nvme: use symbolic constants for CNS values
    nvme.h: add an enum for cns values
    nvme.h: don't use uuid_be
    nvme.h: resync with nvme-cli
    nvme: Add tertiary number to NVME_VS
    nvme : Add sysfs entry for NVMe CMBs when appropriate
    nvme: don't schedule multiple resets
    nvme: Delete created IO queues on reset
    nvme: Stop probing a removed device
    badblocks: fix overlapping check for clearing

    Linus Torvalds
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds