13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

11 May, 2017

1 commit


10 May, 2017

5 commits

  • When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
    "Input/output error". Looks block layer split the bio after calling
    bio_integrity_prep(bio). This patch fixes the issue.

    Below is how we debug this issue:
    (1)format nvme to 4K block # size with type 2 DIF
    (2)dd with block size bigger than 1024k.
    oflag=direct
    dd: error writing '/dev/nvme0n1': Input/output error

    We added some debug code in nvme device driver. It showed us the first
    op and the second op have the same bi and pi address. This is not
    correct.

    1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
    dsmgmt=0x0, AT=0x0 & RT=0x505
    Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

    2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
    AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
    Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

    With the fix, It showed us both of the first op and the second op have
    correct bi and pi address.

    1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
    dsmgmt=0x0, AT=0x0 & RT=0x505
    Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
    0x00002828
    2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
    AT=0x0 & RT=0x605
    Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
    0x00003028

    Signed-off-by: Wen Xiong
    Signed-off-by: Jens Axboe

    Wen Xiong
     
  • If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough
    for us to use this_cpu_ptr() in that section. Use the safer
    get/put_cpu_ptr() variants instead.

    Reported-by: Mike Galbraith
    Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We warn twice for switching to a scheduler, if that switch fails.
    As we also report the failure in the return value to the
    sysfs write, remove the dmesg induced failures.

    Keep the failure print for warning to switch to the kconfig
    selected IO scheduler, as we can't report errors for that in
    any other way.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The introduction of the BFQ and Kyber I/O schedulers has triggered a
    new wave of I/O benchmarks. Unfortunately, comments and discussions on
    these benchmarks confirm that there is still little awareness that it
    is very hard to achieve, at the same time, a low latency and a high
    throughput. In particular, virtually all benchmarks measure
    throughput, or throughput-related figures of merit, but, for BFQ, they
    use the scheduler in its default configuration. This configuration is
    geared, instead, toward a low latency. This is evidently a sign that
    BFQ documentation is still too unclear on this important aspect. This
    commit addresses this issue by stressing how BFQ configuration must be
    (easily) changed if the only goal is maximum throughput.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In the function __bfq_deactivate_entity, the pointer
    entity->sched_data could happen to be used before being properly
    initialized. This led to a NULL pointer dereference. This commit fixes
    this bug by just using this pointer only where it is safe to do so.

    Reported-by: Tom Harrison
    Tested-by: Tom Harrison
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 May, 2017

1 commit

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

08 May, 2017

2 commits

  • Making __blk_mq_stop_hw_queues static fixes sparse warning:

    block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
    declared. Should it be static?

    Fixes: 2719aa217e0d0 ("blk-mq: don't use sync workqueue flushing from drivers")
    Signed-off-by: Colin Ian King
    Signed-off-by: Jens Axboe

    Colin Ian King
     
  • This can be triggered by hot-unplug one cpu.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.11.0+ #17 Not tainted
    -------------------------------------------------------
    step_after_susp/2640 is trying to acquire lock:
    (all_q_mutex){+.+...}, at: [] blk_mq_queue_reinit_work+0x18/0x110

    but task is already holding lock:
    (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x7f/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (cpu_hotplug.lock){+.+.+.}:
    lock_acquire+0x11c/0x230
    __mutex_lock+0x92/0x990
    mutex_lock_nested+0x1b/0x20
    get_online_cpus+0x64/0x80
    blk_mq_init_allocated_queue+0x3a0/0x4e0
    blk_mq_init_queue+0x3a/0x60
    loop_add+0xe5/0x280
    loop_init+0x124/0x177
    do_one_initcall+0x53/0x1c0
    kernel_init_freeable+0x1e3/0x27f
    kernel_init+0xe/0x100
    ret_from_fork+0x31/0x40

    -> #0 (all_q_mutex){+.+...}:
    __lock_acquire+0x189a/0x18a0
    lock_acquire+0x11c/0x230
    __mutex_lock+0x92/0x990
    mutex_lock_nested+0x1b/0x20
    blk_mq_queue_reinit_work+0x18/0x110
    blk_mq_queue_reinit_dead+0x1c/0x20
    cpuhp_invoke_callback+0x1f2/0x810
    cpuhp_down_callbacks+0x42/0x80
    _cpu_down+0xb2/0xe0
    freeze_secondary_cpus+0xb6/0x390
    suspend_devices_and_enter+0x3b3/0xa40
    pm_suspend+0x129/0x490
    state_store+0x82/0xf0
    kobj_attr_store+0xf/0x20
    sysfs_kf_write+0x45/0x60
    kernfs_fop_write+0x135/0x1c0
    __vfs_write+0x37/0x160
    vfs_write+0xcd/0x1d0
    SyS_write+0x58/0xc0
    do_syscall_64+0x8f/0x710
    return_from_SYSCALL_64+0x0/0x7a

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpu_hotplug.lock);
    lock(all_q_mutex);
    lock(cpu_hotplug.lock);
    lock(all_q_mutex);

    *** DEADLOCK ***

    8 locks held by step_after_susp/2640:
    #0: (sb_writers#6){.+.+.+}, at: [] vfs_write+0x1ad/0x1d0
    #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0x101/0x1c0
    #2: (s_active#166){.+.+.+}, at: [] kernfs_fop_write+0x109/0x1c0
    #3: (pm_mutex){+.+...}, at: [] pm_suspend+0x21d/0x490
    #4: (acpi_scan_lock){+.+.+.}, at: [] acpi_scan_lock_acquire+0x17/0x20
    #5: (cpu_add_remove_lock){+.+.+.}, at: [] freeze_secondary_cpus+0x27/0x390
    #6: (cpu_hotplug.dep_map){++++++}, at: [] cpu_hotplug_begin+0x5/0xe0
    #7: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x7f/0xe0

    stack backtrace:
    CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
    Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
    Call Trace:
    dump_stack+0x99/0xce
    print_circular_bug+0x1fa/0x270
    __lock_acquire+0x189a/0x18a0
    lock_acquire+0x11c/0x230
    ? lock_acquire+0x11c/0x230
    ? blk_mq_queue_reinit_work+0x18/0x110
    ? blk_mq_queue_reinit_work+0x18/0x110
    __mutex_lock+0x92/0x990
    ? blk_mq_queue_reinit_work+0x18/0x110
    ? kmem_cache_free+0x2cb/0x330
    ? anon_transport_class_unregister+0x20/0x20
    ? blk_mq_queue_reinit_work+0x110/0x110
    mutex_lock_nested+0x1b/0x20
    ? mutex_lock_nested+0x1b/0x20
    blk_mq_queue_reinit_work+0x18/0x110
    blk_mq_queue_reinit_dead+0x1c/0x20
    cpuhp_invoke_callback+0x1f2/0x810
    ? __flow_cache_shrink+0x160/0x160
    cpuhp_down_callbacks+0x42/0x80
    _cpu_down+0xb2/0xe0
    freeze_secondary_cpus+0xb6/0x390
    suspend_devices_and_enter+0x3b3/0xa40
    ? rcu_read_lock_sched_held+0x79/0x80
    pm_suspend+0x129/0x490
    state_store+0x82/0xf0
    kobj_attr_store+0xf/0x20
    sysfs_kf_write+0x45/0x60
    kernfs_fop_write+0x135/0x1c0
    __vfs_write+0x37/0x160
    ? rcu_read_lock_sched_held+0x79/0x80
    ? rcu_sync_lockdep_assert+0x2f/0x60
    ? __sb_start_write+0xd9/0x1c0
    ? vfs_write+0x1ad/0x1d0
    vfs_write+0xcd/0x1d0
    SyS_write+0x58/0xc0
    ? rcu_read_lock_sched_held+0x79/0x80
    do_syscall_64+0x8f/0x710
    ? trace_hardirqs_on_thunk+0x1a/0x1c
    entry_SYSCALL64_slow_path+0x25/0x25

    The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
    queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
    contend these two locks in the inversion order. This is due to commit eabe06595d62
    (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
    issue because of hotplug rework, however the hotplug rework is still work-in-progress
    and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
    breaks the linus's tree in the merge window, so this patch reverts the lock order
    and avoids to splat linus's tree.

    Cc: Jens Axboe
    Cc: Peter Zijlstra (Intel)
    Cc: Thomas Gleixner
    Signed-off-by: Wanpeng Li
    Signed-off-by: Jens Axboe

    Wanpeng Li
     

07 May, 2017

1 commit

  • Pull block fixes and updates from Jens Axboe:
    "Some fixes and followup features/changes that should go in, in this
    merge window. This contains:

    - Two fixes for lightnvm from Javier, fixing problems in the new code
    merge previously in this merge window.

    - A fix from Jan for the backing device changes, fixing an issue in
    NFS that causes a failure to mount on certain setups.

    - A change from Christoph, cleaning up the blk-mq init and exit
    request paths.

    - Remove elevator_change(), which is now unused. From Bart.

    - A fix for queue operation invocation on a dead queue, from Bart.

    - A series fixing up mtip32xx for blk-mq scheduling, removing a
    bandaid we previously had in place for this. From me.

    - A regression fix for this series, fixing a case where we wait on
    workqueue flushing from an invalid (non-blocking) context. From me.

    - A fix/optimization from Ming, ensuring that we don't both quiesce
    and freeze a queue at the same time.

    - A fix from Peter on lock ordering for CPU hotplug. Not a real
    problem right now, but will be once the CPU hotplug rework goes in.

    - A series from Omar, cleaning up out blk-mq debugfs support, and
    adding support for exporting info from schedulers in debugfs as
    well. This is really useful in debugging stalls or livelocks. From
    Omar"

    * 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
    mq-deadline: add debugfs attributes
    kyber: add debugfs attributes
    blk-mq-debugfs: allow schedulers to register debugfs attributes
    blk-mq: untangle debugfs and sysfs
    blk-mq: move debugfs declarations to a separate header file
    blk-mq: Do not invoke queue operations on a dead queue
    blk-mq-debugfs: get rid of a bunch of boilerplate
    blk-mq-debugfs: rename hw queue directories from to hctx
    blk-mq-debugfs: don't open code strstrip()
    blk-mq-debugfs: error on long write to queue "state" file
    blk-mq-debugfs: clean up flag definitions
    blk-mq-debugfs: separate flags with |
    nfs: Fix bdi handling for cloned superblocks
    block/mq: Cure cpu hotplug lock inversion
    lightnvm: fix bad back free on error path
    lightnvm: create cmd before allocating request
    blk-mq: don't use sync workqueue flushing from drivers
    mtip32xx: convert internal commands to regular block infrastructure
    mtip32xx: cleanup internal tag assumptions
    block: don't call blk_mq_quiesce_queue() after queue is frozen
    ...

    Linus Torvalds
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

04 May, 2017

15 commits

  • Expose the fifo lists, cached next requests, batching state, and
    dispatch list. It'd also be possible to add the sorted lists, but there
    aren't already seq_file helpers for rbtrees.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Expose the domain token pools, asynchronous sbitmap depth, domain
    request lists, and batching state.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This provides the infrastructure for schedulers to expose their internal
    state through debugfs. We add a list of queue attributes and a list of
    hctx attributes to struct elevator_type and wire them up when switching
    schedulers.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke

    Add missing seq_file.h header in blk-mq-debugfs.h

    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Originally, I tied debugfs registration/unregistration together with
    sysfs. There's no reason to do this, and it's getting in the way of
    letting schedulers define their own debugfs attributes. Instead, tie the
    debugfs registration to the lifetime of the structures themselves.

    The saner lifetimes mean we can also get rid of the extra mq directory
    and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
    just nvme0n1/hctx0/tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation for adding more declarations.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • In commit e869b5462f83 ("blk-mq: Unregister debugfs attributes
    earlier"), we shuffled the debugfs cleanup around so that the "state"
    attribute was removed before we freed the blk-mq data structures.
    However, later changes are going to undo that, so we need to explicitly
    disallow running a dead queue.

    [Omar: rebased and updated commit message]
    Signed-off-by: Omar Sandoval
    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A large part of blk-mq-debugfs.c is file_operations and seq_file
    boilerplate. This sucks as is but will suck even more when schedulers
    can define their own debugfs entries. Factor it all out into a single
    blk_mq_debugfs_fops which multiplexes as needed. We store the
    request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory
    dentry, which is kind of hacky, but it works.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • It's not clear what these numbered directories represent unless you
    consult the code. We're about to get rid of the intermediate "mq"
    directory, so these would be even more confusing without that context.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Slightly more readable, plus we also strip leading spaces.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • blk_queue_flags_store() currently truncates and returns a short write if
    the operation being written is too long. This can give us weird results,
    like here:

    $ echo "run bar"
    echo: write error: invalid argument
    $ dmesg
    [ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start'

    Instead, return an error if the user does this. While we're here, make
    the argument names consistent with everywhere else in this file.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Make sure the spelled out flag names match the definition. This also
    adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing
    cmd_flag, __REQ_NOUNMAP.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This reads more naturally than spaces.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • By poking at /debug/sched_features I triggered the following splat:

    [] ======================================================
    [] WARNING: possible circular locking dependency detected
    [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
    [] ------------------------------------------------------
    [] bash/2109 is trying to acquire lock:
    [] (cpu_hotplug_lock.rw_sem){++++++}, at: [] static_key_slow_dec+0x1b/0x50
    []
    [] but task is already holding lock:
    [] (&sb->s_type->i_mutex_key#4){+++++.}, at: [] sched_feat_write+0x86/0x170
    []
    [] which lock already depends on the new lock.
    []
    []
    [] the existing dependency chain (in reverse order) is:
    []
    [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
    [] lock_acquire+0x100/0x210
    [] down_write+0x28/0x60
    [] start_creating+0x5e/0xf0
    [] debugfs_create_dir+0x13/0x110
    [] blk_mq_debugfs_register+0x21/0x70
    [] blk_mq_register_dev+0x64/0xd0
    [] blk_register_queue+0x6a/0x170
    [] device_add_disk+0x22d/0x440
    [] loop_add+0x1f3/0x280
    [] loop_init+0x104/0x142
    [] do_one_initcall+0x43/0x180
    [] kernel_init_freeable+0x1de/0x266
    [] kernel_init+0xe/0x100
    [] ret_from_fork+0x31/0x40
    []
    [] -> #1 (all_q_mutex){+.+.+.}:
    [] lock_acquire+0x100/0x210
    [] __mutex_lock+0x6c/0x960
    [] mutex_lock_nested+0x1b/0x20
    [] blk_mq_init_allocated_queue+0x37c/0x4e0
    [] blk_mq_init_queue+0x3a/0x60
    [] loop_add+0xe5/0x280
    [] loop_init+0x104/0x142
    [] do_one_initcall+0x43/0x180
    [] kernel_init_freeable+0x1de/0x266
    [] kernel_init+0xe/0x100
    [] ret_from_fork+0x31/0x40

    [] *** DEADLOCK ***
    []
    [] 3 locks held by bash/2109:
    [] #0: (sb_writers#11){.+.+.+}, at: [] vfs_write+0x17d/0x1a0
    [] #1: (debugfs_srcu){......}, at: [] full_proxy_write+0x5d/0xd0
    [] #2: (&sb->s_type->i_mutex_key#4){+++++.}, at: [] sched_feat_write+0x86/0x170
    []
    [] stack backtrace:
    [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
    [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
    [] Call Trace:

    [] lock_acquire+0x100/0x210
    [] get_online_cpus+0x2a/0x90
    [] static_key_slow_dec+0x1b/0x50
    [] static_key_disable+0x20/0x30
    [] sched_feat_write+0x131/0x170
    [] full_proxy_write+0x97/0xd0
    [] __vfs_write+0x28/0x120
    [] vfs_write+0xb5/0x1a0
    [] SyS_write+0x49/0xa0
    [] entry_SYSCALL_64_fastpath+0x23/0xc2

    This is because of the cpu hotplug lock rework. Break the chain at #1
    by reversing the lock acquisition order. This way i_mutex_key#4 no
    longer depends on cpu_hotplug_lock and things are good.

    Cc: Jens Axboe
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Peter Zijlstra
     
  • A previous commit introduced the sync flush, which we need from
    internal callers like blk_mq_quiesce_queue(). However, we also
    call the stop helpers from drivers, particularly from ->queue_rq()
    when we have to stop processing for a bit. We can't block from
    those locations, and we don't have to guarantee that we're
    fully flushed.

    Fixes: 9f993737906b ("blk-mq: unify hctx delayed_run_work and run_work")
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull MD updates from Shaohua Li:

    - Add Partial Parity Log (ppl) feature found in Intel IMSM raid array
    by Artur Paszkiewicz. This feature is another way to close RAID5
    writehole. The Linux implementation is also available for normal
    RAID5 array if specific superblock bit is set.

    - A number of md-cluser fixes and enabling md-cluster array resize from
    Guoqing Jiang

    - A bunch of patches from Ming Lei and Neil Brown to rewrite MD bio
    handling related code. Now MD doesn't directly access bio bvec,
    bi_phys_segments and uses modern bio API for bio split.

    - Improve RAID5 IO pattern to improve performance for hard disk based
    RAID5/6 from me.

    - Several patches from Song Liu to speed up raid5-cache recovery and
    allow raid5 cache feature disabling in runtime.

    - Fix a performance regression in raid1 resync from Xiao Ni.

    - Other cleanup and fixes from various people.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (84 commits)
    md/raid10: skip spare disk as 'first' disk
    md/raid1: Use a new variable to count flighting sync requests
    md: clear WantReplacement once disk is removed
    md/raid1/10: remove unused queue
    md: handle read-only member devices better.
    md/raid10: wait up frozen array in handle_write_completed
    uapi: fix linux/raid/md_p.h userspace compilation error
    md-cluster: Fix a memleak in an error handling path
    md: support disabling of create-on-open semantics.
    md: allow creation of mdNNN arrays via md_mod/parameters/new_array
    raid5-ppl: use a single mempool for ppl_io_unit and header_page
    md/raid0: fix up bio splitting.
    md/linear: improve bio splitting.
    md/raid5: make chunk_aligned_read() split bios more cleanly.
    md/raid10: simplify handle_read_error()
    md/raid10: simplify the splitting of requests.
    md/raid1: factor out flush_bio_list()
    md/raid1: simplify handle_read_error().
    Revert "block: introduce bio_copy_data_partial"
    md/raid1: simplify alloc_behind_master_bio()
    ...

    Linus Torvalds
     

03 May, 2017

2 commits

  • After queue is frozen, no request in this queue can be in use at all, so
    there can't be any .queue_rq() running on this queue. It isn't
    necessary to call blk_mq_quiesce_queue() any more, so remove it in both
    elevator_switch_mq() and blk_mq_update_nr_requests().

    Cc: Bart Van Assche
    Signed-off-by: Ming Lei

    Fixed up the description a bit.

    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Pull documentation update from Jonathan Corbet:
    "A reasonably busy cycle for documentation this time around. There is a
    new guide for user-space API documents, rather sparsely populated at
    the moment, but it's a start. Markus improved the infrastructure for
    converting diagrams. Mauro has converted much of the USB documentation
    over to RST. Plus the usual set of fixes, improvements, and tweaks.

    There's a bit more than the usual amount of reaching out of
    Documentation/ to fix comments elsewhere in the tree; I have acks for
    those where I could get them"

    * tag 'docs-4.12' of git://git.lwn.net/linux: (74 commits)
    docs: Fix a couple typos
    docs: Fix a spelling error in vfio-mediated-device.txt
    docs: Fix a spelling error in ioctl-number.txt
    MAINTAINERS: update file entry for HSI subsystem
    Documentation: allow installing man pages to a user defined directory
    Doc/PM: Sync with intel_powerclamp code behavior
    zr364xx.rst: usb/devices is now at /sys/kernel/debug/
    usb.rst: move documentation from proc_usb_info.txt to USB ReST book
    convert philips.txt to ReST and add to media docs
    docs-rst: usb: update old usbfs-related documentation
    arm: Documentation: update a path name
    docs: process/4.Coding.rst: Fix a couple of document refs
    docs-rst: fix usb cross-references
    usb: gadget.h: be consistent at kernel doc macros
    usb: composite.h: fix two warnings when building docs
    usb: get rid of some ReST doc build errors
    usb.rst: get rid of some Sphinx errors
    usb/URB.txt: convert to ReST and update it
    usb/persist.txt: convert to ReST and add to driver-api book
    usb/hotplug.txt: convert to ReST and add to driver-api book
    ...

    Linus Torvalds
     

02 May, 2017

6 commits

  • Remove the request_idx parameter, which can't be used safely now that we
    support I/O schedulers with blk-mq. Except for a superflous check in
    mtip32xx it was unused anyway.

    Also pass the tag_set instead of just the driver data - this allows drivers
    to avoid some code duplication in a follow on cleanup.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We have update the troublesome driver (mtip32xx) to deal with this
    appropriately. So kill the hack that bypassed scheduler allocation
    and insertion for reserved requests.

    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Tested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since commit 84253394927c ("remove the mg_disk driver") removed the
    only caller of elevator_change(), also remove the elevator_change()
    function itself.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Markus Trippelsdorf
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Pull uaccess unification updates from Al Viro:
    "This is the uaccess unification pile. It's _not_ the end of uaccess
    work, but the next batch of that will go into the next cycle. This one
    mostly takes copy_from_user() and friends out of arch/* and gets the
    zero-padding behaviour in sync for all architectures.

    Dealing with the nocache/writethrough mess is for the next cycle;
    fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
    sold on access_ok() in there, BTW; just not in this pile), same for
    reducing __copy_... callsites, strn*... stuff, etc. - there will be a
    pile about as large as this one in the next merge window.

    This one sat in -next for weeks. -3KLoC"

    * 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
    HAVE_ARCH_HARDENED_USERCOPY is unconditional now
    CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
    m32r: switch to RAW_COPY_USER
    hexagon: switch to RAW_COPY_USER
    microblaze: switch to RAW_COPY_USER
    get rid of padding, switch to RAW_COPY_USER
    ia64: get rid of copy_in_user()
    ia64: sanitize __access_ok()
    ia64: get rid of 'segment' argument of __do_{get,put}_user()
    ia64: get rid of 'segment' argument of __{get,put}_user_check()
    ia64: add extable.h
    powerpc: get rid of zeroing, switch to RAW_COPY_USER
    esas2r: don't open-code memdup_user()
    alpha: fix stack smashing in old_adjtimex(2)
    don't open-code kernel_setsockopt()
    mips: switch to RAW_COPY_USER
    mips: get rid of tail-zeroing in primitives
    mips: make copy_from_user() zero tail explicitly
    mips: clean and reorder the forest of macros...
    mips: consolidate __invoke_... wrappers
    ...

    Linus Torvalds
     
  • Shaohua Li
     
  • Pull block layer updates from Jens Axboe:

    - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
    was initially a fork of CFQ, but subsequently changed to implement
    fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
    to be used on desktop type single drives, providing good fairness.
    From Paolo.

    - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
    using a scalable token based algorithm that throttles IO based on
    live completion IO stats, similary to blk-wbt. From Omar.

    - A series from Jan, moving users to separately allocated backing
    devices. This continues the work of separating backing device life
    times, solving various problems with hot removal.

    - A series of updates for lightnvm, mostly from Javier. Includes a
    'pblk' target that exposes an open channel SSD as a physical block
    device.

    - A series of fixes and improvements for nbd from Josef.

    - A series from Omar, removing queue sharing between devices on mostly
    legacy drivers. This helps us clean up other bits, if we know that a
    queue only has a single device backing. This has been overdue for
    more than a decade.

    - Fixes for the blk-stats, and improvements to unify the stats and user
    windows. This both improves blk-wbt, and enables other users to
    register a need to receive IO stats for a device. From Omar.

    - blk-throttle improvements from Shaohua. This provides a scalable
    framework for implementing scalable priotization - particularly for
    blk-mq, but applicable to any type of block device. The interface is
    marked experimental for now.

    - Bucketized IO stats for IO polling from Stephen Bates. This improves
    efficiency of polled workloads in the presence of mixed block size
    IO.

    - A few fixes for opal, from Scott.

    - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
    From a variety of folks, mostly Sagi and James Smart.

    - A series from Bart, improving our exposed info and capabilities from
    the blk-mq debugfs support.

    - A series from Christoph, cleaning up how handle WRITE_ZEROES.

    - A series from Christoph, cleaning up the block layer handling of how
    we track errors in a request. On top of being a nice cleanup, it also
    shrinks the size of struct request a bit.

    - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
    never used by platforms, and the latter has outlived it's usefulness.

    - Various little bug fixes and cleanups from a wide variety of folks.

    * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
    block: hide badblocks attribute by default
    blk-mq: unify hctx delay_work and run_work
    block: add kblock_mod_delayed_work_on()
    blk-mq: unify hctx delayed_run_work and run_work
    nbd: fix use after free on module unload
    MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
    blk-mq-sched: alloate reserved tags out of normal pool
    mtip32xx: use runtime tag to initialize command header
    scsi: Implement blk_mq_ops.show_rq()
    blk-mq: Add blk_mq_ops.show_rq()
    blk-mq: Show operation, cmd_flags and rq_flags names
    blk-mq: Make blk_flags_show() callers append a newline character
    blk-mq: Move the "state" debugfs attribute one level down
    blk-mq: Unregister debugfs attributes earlier
    blk-mq: Only unregister hctxs for which registration succeeded
    blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
    blk-mq: Let blk_mq_debugfs_register() look up the queue name
    blk-mq: Register /queue/mq after having registered /queue
    ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
    ide-pm: always pass 0 error to __blk_end_request_all
    ..

    Linus Torvalds
     

28 Apr, 2017

4 commits

  • Commit 99e6608c9e74 "block: Add badblock management for gendisks"
    allowed for drivers like pmem and software-raid to advertise a list of
    bad media areas. However, it inadvertently added a 'badblocks' to all
    block devices. Lets clean this up by having the 'badblocks' attribute
    not be visible when the driver has not populated a 'struct badblocks'
    instance in the gendisk.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    Reported-by: Vishal Verma
    Signed-off-by: Dan Williams
    Tested-by: Vishal Verma
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • The only difference between ->run_work and ->delay_work, is that
    the latter is used to defer running a queue. This is done by
    marking the queue stopped, and scheduling ->delay_work to run
    sometime in the future. While the queue is stopped, direct runs
    or runs through ->run_work will not run the queue.

    If we combine the handlers, then we need to handle two things:

    1) If a delayed/stopped run is scheduled, then we should not run
    the queue before that has been completed.
    2) If a queue is delayed/stopped, the handler needs to restart
    the queue. Normally a run of a queue with the stopped bit set
    would be a no-op.

    Case 1 is handled by modifying a currently pending queue run
    to the deadline set by the caller of blk_mq_delay_queue().
    Subsequent attempts to queue a queue run will find the work
    item already pending, and direct runs will see a stopped queue
    as before.

    Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
    that tells the work handler that it should clear a stopped
    queue and run the handler.

    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This modifies (or adds, if not currently pending) an existing
    delayed work item.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • They serve the exact same purpose. Get rid of the non-delayed
    work variant, and just run it without delay for the normal case.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Apr, 2017

1 commit

  • At least one driver, mtip32xx, has a hard coded dependency on
    the value of the reserved tag used for internal commands. While
    that should really be fixed up, for now let's ensure that we just
    bypass the scheduler tags an allocation marked as reserved. They
    are used for house keeping or error handling, so we can safely
    ignore them in the scheduler.

    Tested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe