16 Apr, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "A few fixes for the current series. This contains:

    - Two fixes for NVMe:

    One fixes a reset race that can be triggered by repeated
    insert/removal of the module.

    The other fixes an issue on some platforms, where we get probe
    timeouts since legacy interrupts isn't working. This used not to
    be a problem since we had the worker thread poll for completions,
    but since that was killed off, it means those poor souls can't
    successfully probe their NVMe device. Use a proper IRQ check and
    probe (msi-x -> msi ->legacy), like most other drivers to work
    around this. Both from Keith.

    - A loop corruption issue with offset in iters, from Ming Lei.

    - A fix for not having the partition stat per cpu ref count
    initialized before sending out the KOBJ_ADD, which could cause user
    space to access the counter prior to initialization. Also from
    Ming Lei.

    - A fix for using the wrong congestion state, from Kaixu Xia"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: loop: fix filesystem corruption in case of aio/dio
    NVMe: Always use MSI/MSI-x interrupts
    NVMe: Fix reset/remove race
    writeback: fix the wrong congested state variable definition
    block: partition: initialize percpuref before sending out KOBJ_ADD

    Linus Torvalds
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

30 Mar, 2016

1 commit

  • The initialization of partition's percpu_ref should have been done before
    sending out KOBJ_ADD uevent, which may cause userspace to read partition
    table. So the uninitialized percpu_ref may be accessed in data path.

    This patch fixes this issue reported by Naveen.

    Reported-by: Naveen Kaje
    Tested-by: Naveen Kaje
    Fixes: 6c71013ecb7e2(block: partition: convert percpu ref)
    Cc: # v4.3+
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Mar, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "Final round of fixes for this merge window - some of this has come up
    after the initial pull request, and some of it was put in a post-merge
    branch before the merge window.

    This contains:

    - Fix for a bad check for an error on dma mapping in the mtip32xx
    driver, from Alexey Khoroshilov.

    - A set of fixes for lightnvm, from Javier, Matias, and Wenwei.

    - An NVMe completion record corruption fix from Marta, ensuring that
    we read things in the right order.

    - Two writeback fixes from Tejun, marked for stable@ as well.

    - A blk-mq sw queue iterator fix from Thomas, fixing an oops for
    sparse CPU maps. They hit this in the hot plug/unplug rework"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvme: avoid cqe corruption when update at the same time as read
    writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
    writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
    blk-mq: Use proper cpumask iterator
    mtip32xx: fix checks for dma mapping errors
    lightnvm: do not load L2P table if not supported
    lightnvm: do not reserve lun on l2p loading
    nvme: lightnvm: return ppa completion status
    lightnvm: add a bitmap of luns
    lightnvm: specify target's logical address area
    null_blk: add lightnvm null_blk device to the nullb_list

    Linus Torvalds
     

20 Mar, 2016

1 commit

  • queue_for_each_ctx() iterates over per_cpu variables under the assumption that
    the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
    holes. In case there are holes the iteration ends up accessing uninitialized
    memory and crashing as a result.

    Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
    macro blk_ctx_sum() which references queue_for_each_ctx().

    Reported-by: Xiong Zhou
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Thomas Gleixner
     

19 Mar, 2016

2 commits

  • Pull libata updates from Tejun Heo:

    - ahci grew runtime power management support so that the controller can
    be turned off if no devices are attached.

    - sata_via isn't dead yet. It got hotplug support and more refined
    workaround for certain WD drives.

    - Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing
    conflicts in ahci PCI ID table.

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
    ata: ahci_xgene: dereferencing uninitialized pointer in probe
    AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
    ata: sata_rcar: Use ARCH_RENESAS
    sata_via: Implement hotplug for VT6421
    sata_via: Apply WD workaround only when needed on VT6421
    ahci: Add runtime PM support for the host controller
    ahci: Add functions to manage runtime PM of AHCI ports
    ahci: Convert driver to use modern PM hooks
    ahci: Cache host controller version
    scsi: Drop runtime PM usage count after host is added
    scsi: Set request queue runtime PM status back to active on resume
    block: Add blk_set_runtime_active()
    ata: ahci_mvebu: add support for Armada 3700 variant
    libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
    libata: support AHCI on OCTEON platform

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "Here are the core block changes for this merge window. Not a lot of
    exciting stuff going on in this round, most of the changes have been
    on the driver side of things. That pull request is coming next. This
    pull request contains:

    - A set of fixes for chained bio handling from Christoph.

    - A tag bounds check for blk-mq from Hannes, ensuring that we don't
    do something stupid if a device reports an invalid tag value.

    - A set of fixes/updates for the CFQ IO scheduler from Jan Kara.

    - A set of blk-mq fixes from Keith, adding support for dynamic
    hardware queues, and fixing init of max_dev_sectors for stacking
    devices.

    - A fix for the dynamic hw context from Ming.

    - Enabling of cgroup writeback support on a block device, from
    Shaohua"

    * 'for-4.6/core' of git://git.kernel.dk/linux-block:
    blk-mq: add bounds check on tag-to-rq conversion
    block: bio_remaining_done() isn't unlikely
    block: cleanup bio_endio
    block: factor out chained bio completion
    block: don't unecessarily clobber bi_error for chained bios
    block-dev: enable writeback cgroup support
    blk-mq: Fix NULL pointer updating nr_requests
    blk-mq: mark request queue as mq asap
    block: Initialize max_dev_sectors to 0
    blk-mq: dynamic h/w context count
    cfq-iosched: Allow parent cgroup to preempt its child
    cfq-iosched: Allow sync noidle workloads to preempt each other
    cfq-iosched: Reorder checks in cfq_should_preempt()
    cfq-iosched: Don't group_idle if cfqq has big thinktime

    Linus Torvalds
     

17 Mar, 2016

1 commit

  • Pull device mapper updates from Mike Snitzer:

    - Most attention this cycle went to optimizing blk-mq request-based DM
    (dm-mq) that is used exclussively by DM multipath:

    - A stable fix for dm-mq that eliminates excessive context
    switching offers the biggest performance improvement (for both
    IOPs and throughput).

    - But more work is needed, during the next cycle, to reduce
    spinlock contention in DM multipath on large NUMA systems.

    - A stable fix for a NULL pointer seen when DM stats is enabled on a DM
    multipath device that must requeue an IO due to path failure.

    - A stable fix for DM snapshot to disallow the COW and origin devices
    from being identical. This amounts to graceful failure in the face
    of userspace error because these devices shouldn't ever be identical.

    - Stable fixes for DM cache and DM thin provisioning to address crashes
    seen if/when their respective metadata device experiences failures
    that cause the transition to 'fail_io' mode.

    - The DM cache 'mq' policy is now an alias for the 'smq' policy. The
    'smq' policy proved to be consistently better than 'mq'. As such
    'mq', with all its complex user-facing tunables, has been eliminated.

    - Improve DM thin provisioning to consistently return -ENOSPC once the
    thin-pool's data volume is out of space.

    - Improve DM core to properly handle error propagation if
    bio_integrity_clone() fails in clone_bio().

    - Other small cleanups and improvements to DM core.

    * tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (41 commits)
    dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
    dm thin: consistently return -ENOSPC if pool has run out of data space
    dm cache: bump the target version
    dm cache: make sure every metadata function checks fail_io
    dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
    dm cache policy smq: clarify that mq registration failure was for 'mq'
    dm: return error if bio_integrity_clone() fails in clone_bio()
    dm thin metadata: don't issue prefetches if a transaction abort has failed
    dm snapshot: disallow the COW and origin devices from being identical
    dm cache: make the 'mq' policy an alias for 'smq'
    dm: drop unnecessary assignment of md->queue
    dm: reorder 'struct mapped_device' members to fix alignment and holes
    dm: remove dummy definition of 'struct dm_table'
    dm: add 'dm_numa_node' module parameter
    dm thin metadata: remove needless newline from subtree_dec() DMERR message
    dm mpath: cleanup reinstate_path() et al based on code review
    dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
    dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
    dm round robin: use percpu 'repeat_count' and 'current_path'
    dm path selector: remove 'repeat_count' return from .select_path hook
    ...

    Linus Torvalds
     

16 Mar, 2016

2 commits

  • This patch has been carried in the Android tree for quite some time and
    is one of the few patches required to get a mainline kernel up and
    running with an exsiting Android userspace. So I wanted to submit it
    for review and consideration if it should be merged.

    For partitions, add new uevent parameters 'PARTN' which specifies the
    partitions index in the table, and 'PARTNAME', which specifies PARTNAME
    specifices the partition name of a partition device.

    Android's userspace uses this for creating device node links from the
    partition name and number, ie:

    /dev/block/platform/soc/by-name/system
    or
    /dev/block/platform/soc/by-num/p1

    One can see its usage here:
    https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
    and
    https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494

    [john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
    Signed-off-by: Dima Zavin
    Signed-off-by: John Stultz
    Cc: Jens Axboe
    Cc: Rom Lemarchand
    Cc: Android Kernel Team
    Cc: Jeff Moyer
    Cc:
    Cc: Kees Cook
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    San Mehat
     
  • We need to check for a valid index before accessing the array
    element to avoid accessing invalid memory regions.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jeff Moyer

    Modified by Jens to drop the unlikely(), and make the fall through
    path be having a valid tag.

    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

14 Mar, 2016

4 commits


04 Mar, 2016

3 commits

  • A h/w context's tags are freed if it was not assigned a CPU. Check if
    the context has tags before updating the depth.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • This patch adds support for larger requests in blk_rq_map_user_iov by
    allowing it to build multiple bios for a request. This functionality
    used to exist for the non-vectored blk_rq_map_user in the past, and
    this patch reuses the existing functionality for it on the unmap side,
    which stuck around. Thanks to the iov_iter API supporting multiple
    bios is fairly trivial, as we can just iterate the iov until we've
    consumed the whole iov_iter.

    Signed-off-by: Christoph Hellwig
    Reported-by: Jeff Lien
    Tested-by: Jeff Lien
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This patch applies the two introduced helpers to
    figure out the 1st and last bvec.

    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Feb, 2016

1 commit

  • The recent *sync enabling discovered that we are inserting into the
    block_device pagecache counter to the expectations of the dirty data
    tracking for dax mappings. This can lead to data corruption.

    We want to support DAX for block devices eventually, but it requires
    wider changes to properly manage the pagecache.

    dump_stack+0x85/0xc2
    dax_writeback_mapping_range+0x60/0xe0
    blkdev_writepages+0x3f/0x50
    do_writepages+0x21/0x30
    __filemap_fdatawrite_range+0xc6/0x100
    filemap_write_and_wait+0x4a/0xa0
    set_blocksize+0x70/0xd0
    sb_set_blocksize+0x1d/0x50
    ext4_fill_super+0x75b/0x3360
    mount_bdev+0x180/0x1b0
    ext4_mount+0x15/0x20
    mount_fs+0x38/0x170

    Mark the support broken so its disabled by default, but otherwise still
    available for testing.

    Signed-off-by: Dan Williams
    Signed-off-by: Ross Zwisler
    Reported-by: Ross Zwisler
    Suggested-by: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

23 Feb, 2016

1 commit

  • Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
    than if an underlying null_blk device were used directly. One of the
    reasons for this drop in performance is that blk_insert_clone_request()
    was calling blk_mq_insert_request() with @async=true. This forced the
    use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
    which ushered in ping-ponging between process context (fio in this case)
    and kblockd's kworker to submit the cloned request. The ftrace
    function_graph tracer showed:

    kworker-2013 => fio-12190
    fio-12190 => kworker-2013
    ...
    kworker-2013 => fio-12190
    fio-12190 => kworker-2013
    ...

    Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
    _not_ use kblockd to submit the cloned requests isn't enough to
    eliminate the observed context switches.

    In addition to this dm-mq specific blk-core fix, there are 2 DM core
    fixes to dm-mq that (when paired with the blk-core fix) completely
    eliminate the observed context switching:

    1) don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this). So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

    2) use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request). Using blk_mq_complete_request() for
    blk-mq requests is important for performance. It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

    dm-mq fix #2 is _much_ more important than #1 for eliminating the
    context switches.
    Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
    After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

    With these changes multithreaded async read IOPs improved from ~950K
    to ~1350K for this dm-mq stacked on null_blk test-case. The raw read
    IOPs of the underlying null_blk device for the same workload is ~1950K.

    Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
    Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
    Cc: stable@vger.kernel.org # 4.1+
    Reported-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe

    Mike Snitzer
     

19 Feb, 2016

1 commit

  • If block device is left runtime suspended during system suspend, resume
    hook of the driver typically corrects runtime PM status of the device back
    to "active" after it is resumed. However, this is not enough as queue's
    runtime PM status is still "suspended". As long as it is in this state
    blk_pm_peek_request() returns NULL and thus prevents new requests to be
    processed.

    Add new function blk_set_runtime_active() that can be used to force the
    queue status back to "active" as needed.

    Signed-off-by: Mika Westerberg
    Acked-by: Jens Axboe
    Signed-off-by: Tejun Heo

    Mika Westerberg
     

18 Feb, 2016

2 commits

  • Pull block fixes from Jens Axboe:
    "A collection of fixes from the past few weeks that should go into 4.5.
    This contains:

    - Overflow fix for sysfs discard show function from Alan.

    - A stacking limit init fix for max_dev_sectors, so we don't end up
    artificially capping some use cases. From Keith.

    - Have blk-mq proper end unstarted requests on a dying queue, instead
    of pushing that to the driver. From Keith.

    - NVMe:
    - Update to Kconfig description for NVME_SCSI, since it was
    vague and having it on is important for some SUSE distros.
    From Christoph.
    - Set of fixes from Keith, around surprise removal. Also kills
    the no-merge flag, so it supports merging.

    - Set of fixes for lightnvm from Matias, Javier, and Wenwei.

    - Fix null_blk oops when asked for lightnvm, but not available. From
    Matias.

    - Copy-to-user EINTR fix from Hannes, fixing a case where SG_IO fails
    if interrupted by a signal.

    - Two floppy fixes from Jiri, fixing signal handling and blocking
    open.

    - A use-after-free fix for O_DIRECT, from Mike Krinkin.

    - A block module ref count fix from Roman Pen.

    - An fs IO wait accounting fix for O_DSYNC from Stephane Gasparini.

    - Smaller reallo fix for xen-blkfront from Bob Liu.

    - Removal of an unused struct member in the deadline IO scheduler,
    from Tahsin.

    - Also from Tahsin, properly initialize inode struct members
    associated with cgroup writeback, if enabled.

    - From Tejun, ensure that we keep the superblock pinned during cgroup
    writeback"

    * 'for-linus' of git://git.kernel.dk/linux-block: (25 commits)
    blk: fix overflow in queue_discard_max_hw_show
    writeback: initialize inode members that track writeback history
    writeback: keep superblock pinned during cgroup writeback association switches
    bio: return EINTR if copying to user space got interrupted
    NVMe: Rate limit nvme IO warnings
    NVMe: Poll device while still active during remove
    NVMe: Requeue requests on suspended queues
    NVMe: Allow request merges
    NVMe: Fix io incapable return values
    blk-mq: End unstarted requests on dying queue
    block: Initialize max_dev_sectors to 0
    null_blk: oops when initializing without lightnvm
    block: fix module reference leak on put_disk() call for cgroups throttle
    nvme: fix Kconfig description for BLK_DEV_NVME_SCSI
    kernel/fs: fix I/O wait not accounted for RW O_DSYNC
    floppy: refactor open() flags handling
    lightnvm: allow to force mm initialization
    lightnvm: check overflow and correct mlc pairs
    lightnvm: fix request intersection locking in rrpc
    lightnvm: warn if irqs are disabled in lock laddr
    ...

    Linus Torvalds
     
  • We get this right for queue_discard_max_show but not max_hw_show. Follow the
    same pattern as queue_discard_max_show instead so that we don't truncate.

    Signed-off-by: Alan Cox
    Signed-off-by: Jens Axboe

    Alan
     

15 Feb, 2016

1 commit

  • Currently q->mq_ops is used widely to decide if the queue
    is mq or not, so we should set the 'flag' asap so that both
    block core and drivers can get the correct mq info.

    For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
    moves the hctx's initialization before setting q->mq_ops in
    blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
    to think the queue is non-mq and don't allocate command size
    for the per-hctx flush rq.

    This patches should fix the problem reported by Sasha.

    Cc: Keith Busch
    Reported-by: Sasha Levin
    Signed-off-by: Ming Lei
    Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count")
    Signed-off-by: Jens Axboe

    Ming Lei
     

13 Feb, 2016

1 commit

  • Pull SCSI fixes from James Bottomley:
    "A set of seven fixes:

    Two regressions in the new hisi_sas arm driver, a blacklist entry for
    the marvell console which was causing a reset cascade without it, a
    race fix in the WRITE_SAME/DISCARD routines, a retry fix for the rdac
    driver, without which, it would prematurely return EIO and a couple of
    fixes for the hyper-v storvsc driver"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled
    SCSI: Add Marvell Console to VPD blacklist
    scsi_dh_rdac: always retry MODE SELECT on command lock violation
    storvsc: Use the specified target ID in device lookup
    storvsc: Install the storvsc specific timeout handler for FC devices
    hisi_sas: fix v1 hw check for slot error
    hisi_sas: add dependency for HAS_IOMEM

    Linus Torvalds
     

12 Feb, 2016

3 commits

  • Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for
    current->mm to see if we have a user space context and only copies data
    if we do. Now if an IO gets interrupted by a signal data isn't copied
    into user space any more (as we don't have a user space context) but
    user space isn't notified about it.

    This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
    to notify userland that a signal has interrupted the syscall, otherwise
    it could lead to a situation where the caller may get a buffer with
    no data returned.

    This can be reproduced by issuing SG_IO ioctl()s in one thread while
    constantly sending signals to it.

    Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Hannes Reinecke
    Cc: stable@vger.kernel.org # v.3.11+
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • Go directly to ending a request if it wasn't started. Previously, completing a
    request may invoke a driver callback for a request it didn't initialize.

    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Johannes Thumshirn
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • The new queue limit is not used by the majority of block drivers, and
    should be initialized to 0 for the driver's requested settings to be used.

    Signed-off-by: Keith Busch
    Acked-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

11 Feb, 2016

1 commit

  • The new queue limit is not used by the majority of block drivers, and
    should be initialized to 0 for the driver's requested settings to be used.

    Signed-off-by: Keith Busch
    Acked-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

10 Feb, 2016

3 commits

  • The hardware's provided queue count may change at runtime with resource
    provisioning. This patch allows a block driver to alter the number of
    h/w queues available when its resource count changes.

    The main part is a new blk-mq API to request a new number of h/w queues
    for a given live tag set. The new API freezes all queues using that set,
    then adjusts the allocated count prior to remapping these to CPUs.

    The bulk of the rest just shifts where h/w contexts and all their
    artifacts are allocated and freed.

    The number of max h/w contexts is capped to the number of possible cpus
    since there is no use for more than that. As such, all pre-allocated
    memory for pointers need to account for the max possible rather than
    the initial number of queues.

    A side effect of this is that the blk-mq will proceed successfully as
    long as it can allocate at least one h/w context. Previously it would
    fail request queue initialization if less than the requested number
    was allocated.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Tested-by: Jon Derrick
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • get_disk(),get_gendisk() calls have non explicit side effect: they
    increase the reference on the disk owner module.

    The following is the correct sequence how to get a disk reference and
    to put it:

    disk = get_gendisk(...);

    /* use disk */

    owner = disk->fops->owner;
    put_disk(disk);
    module_put(owner);

    fs/block_dev.c is aware of this required module_put() call, but f.e.
    blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
    a module reference. To see a leakage in action cgroups throttle config
    can be used. In the following script I'm removing throttle for /dev/ram0
    (actually this is NOP, because throttle was never set for this device):

    # lsmod | grep brd
    brd 5175 0
    # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
    /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
    done
    # lsmod | grep brd
    brd 5175 100

    Now brd module has 100 references.

    The issue is fixed by calling module_put() just right away put_disk().

    Signed-off-by: Roman Pen
    Cc: Gi-Oh Kim
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Jens Axboe

    Roman Pen
     
  • When a process is doing Random Write with O_DSYNC flag
    the I/O wait are not accounted in the kernel (get_cpu_iowait_time_us).
    This is preventing the governor or the cpufreq driver to account for
    I/O wait and thus use the right pstate

    Signed-off-by: Stephane Gasparini
    Signed-off-by: Philippe Longepe
    Signed-off-by: Jens Axboe

    Stephane Gasparini
     

05 Feb, 2016

6 commits

  • James Bottomley
     
  • When a storage device rejects a WRITE SAME command we will disable write
    same functionality for the device and return -EREMOTEIO to the block
    layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
    failing the path.

    Yiwen Jiang discovered a small race where WRITE SAME requests issued
    simultaneously would cause -EIO to be returned. This happened because
    any requests being prepared after WRITE SAME had been disabled for the
    device caused us to return BLKPREP_KILL. The latter caused the block
    layer to return -EIO upon completion.

    To overcome this we introduce BLKPREP_INVALID which indicates that this
    is an invalid request for the device. blk_peek_request() is modified to
    return -EREMOTEIO in that case.

    Reported-by: Yiwen Jiang
    Suggested-by: Mike Snitzer
    Reviewed-by: Hannes Reinicke
    Reviewed-by: Ewan Milne
    Reviewed-by: Yiwen Jiang
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     
  • Currently we don't allow sync workload of one cgroup to preempt sync
    workload of any other cgroup. This is because we want to achieve service
    separation between cgroups. However in cases where cgroup preempting is
    ancestor of the current cgroup, there is no need of separation and
    idling introduces unnecessary overhead. This hurts for example the case
    when workload is isolated within a cgroup but journalling threads are in
    root cgroup. Simple way to demostrate the issue is using:

    dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1

    on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
    difference more visible). When all processes are in the root cgroup,
    reported throughput is 153.132 MB/sec. When dbench process gets its own
    blkio cgroup, reported throughput drops to 26.1006 MB/sec.

    Fix the problem by making check in cfq_should_preempt() more benevolent
    and allow preemption by ancestor cgroup. This improves the throughput
    reported by dbench4 to 48.9106 MB/sec.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • The original idea with preemption of sync noidle queues (introduced in
    commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was
    that we service all sync noidle queues together, we don't idle on any of
    the queues individually and we idle only if there is no sync noidle
    queue to be served. This intention also matches the original test:

    if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
    && new_cfqq->service_tree == cfqq->service_tree)
    return true;

    However since at that time cfqq->service_tree was not set for idling
    queues, this test was unreliable and was replaced in commit e4a229196a7c
    "cfq-iosched: fix no-idle preemption logic" by:

    if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
    new_cfqq->service_tree->count == 1)
    return true;

    That was a reliable test but was actually doing something different -
    now we preempt sync noidle queue only if the new queue is the only one
    busy in the service tree.

    These days cfq queue is kept in service tree even if it is idling and
    thus the original check would be safe again. But since we actually check
    that cfq queues are in the same cgroup, of the same priority class and
    workload type (sync noidle), we know that new_cfqq is fine to preempt
    cfqq. So just remove the service tree check.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Move check for preemption by rt class up. There is no functional change
    but it makes arguing about conditions simpler since we can be sure both
    cfq queues are from the same ioprio class.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • There is no point in idling on a cfq group if the only cfq queue that is
    there has too big thinktime.

    Signed-off-by: Jan Kara
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Feb, 2016

2 commits

  • Pull libnvdimm fixes from Dan Williams:
    "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
    area for storing a struct page array.

    2/ Fixes for dax operations on a raw block device to prevent pagecache
    collisions with dax mappings.

    3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
    pointer de-reference.

    These have received build success notification from the kbuild robot
    across 153 configs and pass the latest ndctl tests"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    phys_to_pfn_t: use phys_addr_t
    mm: fix pfn_t to page conversion in vm_insert_mixed
    block: use DAX for partition table reads
    block: revert runtime dax control of the raw block device
    fs, block: force direct-I/O for dax-enabled block devices
    devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
    libnvdimm, pfn: fix restoring memmap location
    libnvdimm: fix mode determination for e820 devices

    Linus Torvalds
     
  • commit 63de428b139d3d31d86ebe25ae97b33f6540fb7e ("deadline-iosched:
    allow non-sequential batching") removed last use of last_sector.

    Signed-off-by: Tahsin Erdogan
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Tahsin Erdogan