14 Sep, 2016

1 commit

  • In order to help determine the effectiveness of polling in a running
    system it is usful to determine the ratio of how often the poll
    function is called vs how often the completion is checked. For this
    reason we add a poll_considered variable and add it to the sysfs entry
    for io_poll.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     

29 Aug, 2016

2 commits


17 Aug, 2016

1 commit

  • blk_set_queue_dying() can be called while another thread is
    submitting I/O or changing queue flags, e.g. through dm_stop_queue().
    Hence protect the QUEUE_FLAG_DYING flag change with locking.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Cc: stable
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Jul, 2016

1 commit

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     

21 Jul, 2016

3 commits

  • I wish the OSD code could simply use blk_rq_map_* helpers like
    everyone else, but the complex nature of deciding if we have
    DATA IN and/or DATA OUT buffers might make this impossible
    (at least for a mere human like me).

    But using blk_rq_append_bio at least allows sharing the setup code
    between request with or without dat a buffers, and given that this
    is the last user of blk_make_request it allows getting rid of that
    somewhat awkward interface.

    Signed-off-by: Christoph Hellwig
    Acked-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The target SCSI passthrough backend is much better served with the low-level
    blk_rq_append_bio construct then the helpers built on top of it, so export it.

    Also use the opportunity to remove the pointless request_queue argument and
    make the code flow a little more readable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_get_request is used for BLOCK_PC and similar passthrough requests.
    Currently we always need to call blk_rq_set_block_pc or an open coded
    version of it to allow appending bios using the request mapping helpers
    later on, which is a somewhat awkward API. Instead move the
    initialization part of blk_rq_set_block_pc into blk_get_request, so that
    we always have a safe to use request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Jul, 2016

1 commit

  • The new NVMe over fabrics target will make use of this outside from a
    module.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

10 Jun, 2016

1 commit

  • If we're queuing REQ_PRIO IO and the task is running at an idle IO
    class, then temporarily boost the priority. This prevents livelocks
    due to priority inversion, when a low priority task is holding file
    system resources while attempting to do IO.

    An example of that is shown below. An ioniced idle task is holding
    the directory mutex, while a normal priority task is trying to do
    a directory lookup.

    [478381.198925] ------------[ cut here ]------------
    [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
    [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
    [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
    [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
    [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
    [478381.206874] Call Trace:
    [478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.208175] [] schedule+0x37/0x90
    [478381.208932] [] schedule_timeout+0x1dc/0x250
    [478381.209805] [] ? __blk_run_queue+0x37/0x50
    [478381.210706] [] ? ktime_get+0x45/0xb0
    [478381.211489] [] io_schedule_timeout+0xa7/0x110
    [478381.212402] [] ? prepare_to_wait+0x5b/0x90
    [478381.213280] [] bit_wait_io+0x36/0x50
    [478381.214063] [] __wait_on_bit+0x65/0x90
    [478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
    [478381.216806] [] ? wake_atomic_t_function+0x40/0x40
    [478381.217773] [] __wait_on_buffer+0x2a/0x30
    [478381.218641] [] ext4_bread+0x57/0x70
    [478381.219425] [] __ext4_read_dirblock+0x3c/0x380
    [478381.220467] [] ext4_dx_find_entry+0x7d/0x170
    [478381.221357] [] ? find_get_entry+0x1e/0xa0
    [478381.222208] [] ext4_find_entry+0x484/0x510
    [478381.223090] [] ext4_lookup+0x52/0x160
    [478381.223882] [] lookup_real+0x1d/0x60
    [478381.224675] [] __lookup_hash+0x38/0x50
    [478381.225697] [] lookup_slow+0x45/0xab
    [478381.226941] [] link_path_walk+0x7ae/0x820
    [478381.227880] [] path_init+0xc2/0x430
    [478381.228677] [] ? security_file_alloc+0x16/0x20
    [478381.229776] [] path_openat+0x77/0x620
    [478381.230767] [] ? page_add_file_rmap+0x2e/0x70
    [478381.232019] [] do_filp_open+0x43/0xa0
    [478381.233016] [] ? creds_are_invalid+0x29/0x70
    [478381.234072] [] do_open_execat+0x70/0x170
    [478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
    [478381.236051] [] do_execve+0x2c/0x30
    [478381.236809] [] ? getname+0x12/0x20
    [478381.237564] [] SyS_execve+0x2e/0x40
    [478381.238338] [] stub_execve+0x6d/0xa0
    [478381.239126] ------------[ cut here ]------------
    [478381.239915] ------------[ cut here ]------------
    [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
    [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
    [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
    [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
    [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
    [478381.248252] Call Trace:
    [478381.248630] [] schedule+0x37/0x90
    [478381.249382] [] schedule_preempt_disabled+0xe/0x10
    [478381.250465] [] __mutex_lock_slowpath+0x92/0x100
    [478381.251409] [] mutex_lock+0x1b/0x2f
    [478381.252199] [] lookup_slow+0x36/0xab
    [478381.253023] [] link_path_walk+0x7ae/0x820
    [478381.253877] [] ? try_charge+0xc1/0x700
    [478381.254690] [] path_init+0xc2/0x430
    [478381.255525] [] ? security_file_alloc+0x16/0x20
    [478381.256450] [] path_openat+0x77/0x620
    [478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
    [478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
    [478381.259309] [] do_filp_open+0x43/0xa0
    [478381.260139] [] ? __alloc_fd+0x42/0x120
    [478381.260962] [] do_sys_open+0x13c/0x230
    [478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
    [478381.262851] [] SyS_open+0x22/0x30
    [478381.263598] [] system_call_fastpath+0x12/0x17
    [478381.264551] ------------[ cut here ]------------
    [478381.265377] ------------[ cut here ]------------

    Signed-off-by: Jens Axboe
    Reviewed-by: Jeff Moyer

    Jens Axboe
     

09 Jun, 2016

1 commit


08 Jun, 2016

10 commits


14 Apr, 2016

1 commit


13 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

19 Mar, 2016

1 commit

  • Pull libata updates from Tejun Heo:

    - ahci grew runtime power management support so that the controller can
    be turned off if no devices are attached.

    - sata_via isn't dead yet. It got hotplug support and more refined
    workaround for certain WD drives.

    - Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing
    conflicts in ahci PCI ID table.

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
    ata: ahci_xgene: dereferencing uninitialized pointer in probe
    AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
    ata: sata_rcar: Use ARCH_RENESAS
    sata_via: Implement hotplug for VT6421
    sata_via: Apply WD workaround only when needed on VT6421
    ahci: Add runtime PM support for the host controller
    ahci: Add functions to manage runtime PM of AHCI ports
    ahci: Convert driver to use modern PM hooks
    ahci: Cache host controller version
    scsi: Drop runtime PM usage count after host is added
    scsi: Set request queue runtime PM status back to active on resume
    block: Add blk_set_runtime_active()
    ata: ahci_mvebu: add support for Armada 3700 variant
    libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
    libata: support AHCI on OCTEON platform

    Linus Torvalds
     

23 Feb, 2016

1 commit

  • Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
    than if an underlying null_blk device were used directly. One of the
    reasons for this drop in performance is that blk_insert_clone_request()
    was calling blk_mq_insert_request() with @async=true. This forced the
    use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
    which ushered in ping-ponging between process context (fio in this case)
    and kblockd's kworker to submit the cloned request. The ftrace
    function_graph tracer showed:

    kworker-2013 => fio-12190
    fio-12190 => kworker-2013
    ...
    kworker-2013 => fio-12190
    fio-12190 => kworker-2013
    ...

    Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
    _not_ use kblockd to submit the cloned requests isn't enough to
    eliminate the observed context switches.

    In addition to this dm-mq specific blk-core fix, there are 2 DM core
    fixes to dm-mq that (when paired with the blk-core fix) completely
    eliminate the observed context switching:

    1) don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this). So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

    2) use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request). Using blk_mq_complete_request() for
    blk-mq requests is important for performance. It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

    dm-mq fix #2 is _much_ more important than #1 for eliminating the
    context switches.
    Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
    After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

    With these changes multithreaded async read IOPs improved from ~950K
    to ~1350K for this dm-mq stacked on null_blk test-case. The raw read
    IOPs of the underlying null_blk device for the same workload is ~1950K.

    Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
    Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
    Cc: stable@vger.kernel.org # 4.1+
    Reported-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe

    Mike Snitzer
     

19 Feb, 2016

1 commit

  • If block device is left runtime suspended during system suspend, resume
    hook of the driver typically corrects runtime PM status of the device back
    to "active" after it is resumed. However, this is not enough as queue's
    runtime PM status is still "suspended". As long as it is in this state
    blk_pm_peek_request() returns NULL and thus prevents new requests to be
    processed.

    Add new function blk_set_runtime_active() that can be used to force the
    queue status back to "active" as needed.

    Signed-off-by: Mika Westerberg
    Acked-by: Jens Axboe
    Signed-off-by: Tejun Heo

    Mika Westerberg
     

05 Feb, 2016

2 commits

  • James Bottomley
     
  • When a storage device rejects a WRITE SAME command we will disable write
    same functionality for the device and return -EREMOTEIO to the block
    layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
    failing the path.

    Yiwen Jiang discovered a small race where WRITE SAME requests issued
    simultaneously would cause -EIO to be returned. This happened because
    any requests being prepared after WRITE SAME had been disabled for the
    device caused us to return BLKPREP_KILL. The latter caused the block
    layer to return -EIO upon completion.

    To overcome this we introduce BLKPREP_INVALID which indicates that this
    is an invalid request for the device. blk_peek_request() is modified to
    return -EREMOTEIO in that case.

    Reported-by: Yiwen Jiang
    Suggested-by: Mike Snitzer
    Reviewed-by: Hannes Reinicke
    Reviewed-by: Ewan Milne
    Reviewed-by: Yiwen Jiang
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     

22 Jan, 2016

1 commit

  • Pull NVMe updates from Jens Axboe:
    "Last branch for this series is the nvme changes. It's in a separate
    branch to avoid splitting too much between core and NVMe changes,
    since NVMe is still helping drive some blk-mq changes. That said, not
    a huge amount of core changes in here. The grunt of the work is the
    continued split of the code"

    * 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits)
    uapi: update install list after nvme.h rename
    NVMe: Export NVMe attributes to sysfs group
    NVMe: Shutdown controller only for power-off
    NVMe: IO queue deletion re-write
    NVMe: Remove queue freezing on resets
    NVMe: Use a retryable error code on reset
    NVMe: Fix admin queue ring wrap
    nvme: make SG_IO support optional
    nvme: fixes for NVME_IOCTL_IO_CMD on the char device
    nvme: synchronize access to ctrl->namespaces
    nvme: Move nvme_freeze/unfreeze_queues to nvme core
    PCI/AER: include header file
    NVMe: Export namespace attributes to sysfs
    NVMe: Add pci error handlers
    block: remove REQ_NO_TIMEOUT flag
    nvme: merge iod and cmd_info
    nvme: meta_sg doesn't have to be an array
    nvme: properly free resources for cancelled command
    nvme: simplify completion handling
    nvme: special case AEN requests
    ...

    Linus Torvalds
     

20 Jan, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "We don't have a lot of core changes this time around, it's mostly in
    drivers, which will come in a subsequent pull.

    The cores changes include:

    - blk-mq
    - Prep patch from Christoph, changing blk_mq_alloc_request() to
    take flags instead of just using gfp_t for sleep/nosleep.
    - Doc patch from me, clarifying the difference between legacy
    and blk-mq for timer usage.
    - Fixes from Raghavendra for memory-less numa nodes, and a reuse
    of CPU masks.

    - Cleanup from Geliang Tang, using offset_in_page() instead of open
    coding it.

    - From Ilya, rename request_queue slab to it reflects what it holds,
    and a fix for proper use of bdgrab/put.

    - A real fix for the split across stripe boundaries from Keith. We
    yanked a broken version of this from 4.4-rc final, this one works.

    - From Mike Krinkin, emit a trace message when we split.

    - From Wei Tang, two small cleanups, not explicitly clearing memory
    that is already cleared"

    * 'for-4.5/core' of git://git.kernel.dk/linux-block:
    block: use bd{grab,put}() instead of open-coding
    block: split bios to max possible length
    block: add call to split trace point
    blk-mq: Avoid memoryless numa node encoded in hctx numa_node
    blk-mq: Reuse hardware context cpumask for tags
    blk-mq: add a flags parameter to blk_mq_alloc_request
    Revert "blk-flush: Queue through IO scheduler when flush not required"
    block: clarify blk_add_timer() use case for blk-mq
    bio: use offset_in_page macro
    block: do not initialise statics to 0 or NULL
    block: do not initialise globals to 0 or NULL
    block: rename request_queue slab cache

    Linus Torvalds
     

29 Dec, 2015

1 commit


23 Dec, 2015

2 commits

  • blk_queue_bio() does split then bounce, which makes the segment
    counting based on pages before bouncing and could go wrong. Move
    the split to after bouncing, like we do for blk-mq, and the we
    fix the issue of having the bio count for segments be wrong.

    Fixes: 54efd50bfd87 ("block: make generic_make_request handle arbitrarily sized bios")
    Cc: stable@vger.kernel.org
    Tested-by: Artem S. Tashkinov
    Signed-off-by: Jens Axboe

    Junichi Nomura
     
  • Timer context is not very useful for drivers to perform any meaningful abort
    action from. So instead of calling the driver from this useless context
    defer it to a workqueue as soon as possible.

    Note that while a delayed_work item would seem the right thing here I didn't
    dare to use it due to the magic in blk_add_timer that pokes deep into timer
    internals. But maybe this encourages Tejun to add a sensible API for that to
    the workqueue API and we'll all be fine in the end :)

    Contains a major update from Keith Bush:

    "This patch removes synchronizing the timeout work so that the timer can
    start a freeze on its own queue. The timer enters the queue, so timer
    context can only start a freeze, but not wait for frozen."

    Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Dec, 2015

1 commit

  • The routines in scsi_pm.c assume that if a runtime-PM callback is
    invoked for a SCSI device, it can only mean that the device's driver
    has asked the block layer to handle the runtime power management (by
    calling blk_pm_runtime_init(), which among other things sets q->dev).

    However, this assumption turns out to be wrong for things like the ses
    driver. Normally ses devices are not allowed to do runtime PM, but
    userspace can override this setting. If this happens, the kernel gets
    a NULL pointer dereference when blk_post_runtime_resume() tries to use
    the uninitialized q->dev pointer.

    This patch fixes the problem by checking q->dev in block layer before
    handle runtime PM. Since ses doesn't define any PM callbacks and call
    blk_pm_runtime_init(), the crash won't occur.

    This fixes Bugzilla #101371.
    https://bugzilla.kernel.org/show_bug.cgi?id=101371

    More discussion can be found from below link.
    http://marc.info/?l=linux-scsi&m=144163730531875&w=2

    Signed-off-by: Ken Xue
    Acked-by: Alan Stern
    Cc: Xiangliang Yu
    Cc: James E.J. Bottomley
    Cc: Jens Axboe
    Cc: Michael Terry
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Ken Xue
     

02 Dec, 2015

1 commit


30 Nov, 2015

1 commit

  • When a cloned request is retried on other queues it always needs
    to be checked against the queue limits of that queue.
    Otherwise the calculations for nr_phys_segments might be wrong,
    leading to a crash in scsi_init_sgtable().

    To clarify this the patch renames blk_rq_check_limits()
    to blk_cloned_rq_check_limits() and removes the symbol
    export, as the new function should only be used for
    cloned requests and never exported.

    Cc: Mike Snitzer
    Cc: Ewan Milne
    Cc: Jeff Moyer
    Signed-off-by: Hannes Reinecke
    Fixes: e2a60da74 ("block: Clean up special command handling logic")
    Cc: stable@vger.kernel.org # 3.7+
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

25 Nov, 2015

2 commits