20 Jan, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "We don't have a lot of core changes this time around, it's mostly in
    drivers, which will come in a subsequent pull.

    The cores changes include:

    - blk-mq
    - Prep patch from Christoph, changing blk_mq_alloc_request() to
    take flags instead of just using gfp_t for sleep/nosleep.
    - Doc patch from me, clarifying the difference between legacy
    and blk-mq for timer usage.
    - Fixes from Raghavendra for memory-less numa nodes, and a reuse
    of CPU masks.

    - Cleanup from Geliang Tang, using offset_in_page() instead of open
    coding it.

    - From Ilya, rename request_queue slab to it reflects what it holds,
    and a fix for proper use of bdgrab/put.

    - A real fix for the split across stripe boundaries from Keith. We
    yanked a broken version of this from 4.4-rc final, this one works.

    - From Mike Krinkin, emit a trace message when we split.

    - From Wei Tang, two small cleanups, not explicitly clearing memory
    that is already cleared"

    * 'for-4.5/core' of git://git.kernel.dk/linux-block:
    block: use bd{grab,put}() instead of open-coding
    block: split bios to max possible length
    block: add call to split trace point
    blk-mq: Avoid memoryless numa node encoded in hctx numa_node
    blk-mq: Reuse hardware context cpumask for tags
    blk-mq: add a flags parameter to blk_mq_alloc_request
    Revert "blk-flush: Queue through IO scheduler when flush not required"
    block: clarify blk_add_timer() use case for blk-mq
    bio: use offset_in_page macro
    block: do not initialise statics to 0 or NULL
    block: do not initialise globals to 0 or NULL
    block: rename request_queue slab cache

    Linus Torvalds
     

14 Jan, 2016

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has appeared in -next and independently received a
    build success notification from the kbuild robot. The 'for-4.5/block-
    dax' topic branch was rebased over the weekend to drop the "block
    device end-of-life" rework that Al would like to see re-implemented
    with a notifier, and to address bug reports against the badblocks
    integration.

    There is pending feedback against "libnvdimm: Add a poison list and
    export badblocks" received last week. Linda identified some localized
    fixups that we will handle incrementally.

    Summary:

    - Media error handling: The 'badblocks' implementation that
    originated in md-raid is up-levelled to a generic capability of a
    block device. This initial implementation is limited to being
    consulted in the pmem block-i/o path. Later, 'badblocks' will be
    consulted when creating dax mappings.

    - Raw block device dax: For virtualization and other cases that want
    large contiguous mappings of persistent memory, add the capability
    to dax-mmap a block device directly.

    - Increased /dev/mem restrictions: Add an option to treat all
    io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access
    while a driver is actively using an address range. This behavior
    is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be
    overridden by the existing "iomem=relaxed" kernel command line
    option.

    - Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
    block device shutdown crash fix, and other small libnvdimm fixes"

    * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits)
    block: kill disk_{check|set|clear|alloc}_badblocks
    libnvdimm, pmem: nvdimm_read_bytes() badblocks support
    pmem, dax: disable dax in the presence of bad blocks
    pmem: fail io-requests to known bad blocks
    libnvdimm: convert to statically allocated badblocks
    libnvdimm: don't fail init for full badblocks list
    block, badblocks: introduce devm_init_badblocks
    block: clarify badblocks lifetime
    badblocks: rename badblocks_free to badblocks_exit
    libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
    libnvdimm: Add a poison list and export badblocks
    nfit_test: Enable DSMs for all test NFITs
    md: convert to use the generic badblocks code
    block: Add badblock management for gendisks
    badblocks: Add core badblock management code
    block: fix del_gendisk() vs blkdev_ioctl crash
    block: enable dax for raw block devices
    block: introduce bdev_file_inode()
    restrict /dev/mem to idle io memory ranges
    arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
    ...

    Linus Torvalds
     

13 Jan, 2016

1 commit

  • This splits bio in the middle of a vector to form the largest possible
    bio at the h/w's desired alignment, and guarantees the bio being split
    will have some data.

    The criteria for splitting is changed from the max sectors to the h/w's
    optimal sector alignment if it is provided. For h/w that advertise their
    block storage's underlying chunk size, it's a big performance win to not
    submit commands that cross them. If sector alignment is not provided,
    this patch uses the max sectors as before.

    This addresses the performance issue commit d380561113 attempted to
    fix, but was reverted due to splitting logic error.

    Signed-off-by: Keith Busch
    Cc: Jens Axboe
    Cc: Ming Lei
    Cc: Kent Overstreet
    Cc: # 4.4.x-
    Signed-off-by: Jens Axboe

    Keith Busch
     

10 Jan, 2016

6 commits

  • These actions are completely managed by a block driver or can use the
    badblocks api directly.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Longer term teach dax to punch "error" holes in mapping requests and
    deliver SIGBUS to applications that consume a bad pmem page. For now,
    simply disable the dax performance optimization in the presence of known
    errors.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Provide a devres interface for initializing a badblocks instance. The
    pmem driver has several scenarios where it will be beneficial to have
    this structure automatically freed when the device is disabled / fails
    probe.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • The badblocks list attached to a gendisk is allocated by the driver
    which equates to the driver owning the lifetime of the object. Do not
    automatically free it in del_gendisk(). This is in preparation for
    expanding the use of badblocks in libnvdimm drivers and introducing
    devm_init_badblocks().

    Signed-off-by: Dan Williams

    Dan Williams
     
  • For symmetry with badblocks_init() make it clear that this path only
    destroys incremental allocations of a badblocks instance, and does not
    free the badblocks instance itself.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • NVDIMM devices, which can behave more like DRAM rather than block
    devices, may develop bad cache lines, or 'poison'. A block device
    exposed by the pmem driver can then consume poison via a read (or
    write), and cause a machine check. On platforms without machine
    check recovery features, this would mean a crash.

    The block device maintaining a runtime list of all known sectors that
    have poison can directly avoid this, and also provide a path forward
    to enable proper handling/recovery for DAX faults on such a device.

    Use the new badblock management interfaces to add a badblocks list to
    gendisks.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

09 Jan, 2016

4 commits

  • Take the core badblocks implementation from md, and make it generally
    available. This follows the same style as kernel implementations of
    linked lists, rb-trees etc, where you can have a structure that can be
    embedded anywhere, and accessor functions to manipulate the data.

    The only changes in this copy of the code are ones to generalize
    function/variable names from md-specific ones. Also add init and free
    functions.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • When tearing down a block device early in its lifetime, userspace may
    still be performing discovery actions like blkdev_ioctl() to re-read
    partitions.

    The nvdimm_revalidate_disk() implementation depends on
    disk->driverfs_dev to be valid at entry. However, it is set to NULL in
    del_gendisk() and fatally this is happening *before* the disk device is
    deleted from userspace view.

    There's no reason for del_gendisk() to clear ->driverfs_dev. That
    device is the parent of the disk. It is guaranteed to not be freed
    until the disk, as a child, drops its ->parent reference.

    We could also fix this issue locally in nvdimm_revalidate_disk() by
    using disk_to_dev(disk)->parent, but lets fix it globally since
    ->driverfs_dev follows the lifetime of the parent. Longer term we
    should probably just add a @parent parameter to add_disk(), and stop
    carrying this pointer in the gendisk.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
    CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G O 4.4.0-rc5 #2257
    [..]
    Call Trace:
    [] rescan_partitions+0x87/0x2c0
    [] ? __lock_is_held+0x49/0x70
    [] __blkdev_reread_part+0x72/0xb0
    [] blkdev_reread_part+0x25/0x40
    [] blkdev_ioctl+0x4fd/0x9c0
    [] ? current_kernel_time64+0x69/0xd0
    [] block_ioctl+0x3d/0x50
    [] do_vfs_ioctl+0x308/0x560
    [] ? __audit_syscall_entry+0xb1/0x100
    [] ? do_audit_syscall_entry+0x66/0x70
    [] SyS_ioctl+0x79/0x90
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    Reported-by: Robert Hu
    Signed-off-by: Dan Williams

    Dan Williams
     
  • If an application wants exclusive access to all of the persistent memory
    provided by an NVDIMM namespace it can use this raw-block-dax facility
    to forgo establishing a filesystem. This capability is targeted
    primarily to hypervisors wanting to provision persistent memory for
    guests. It can be disabled / enabled dynamically via the new BLKDAXSET
    ioctl.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     
  • This reverts commit d3805611130af9b911e908af9f67a3f64f4f0914.

    If we end up splitting on the first segment, we don't adjust
    the sector count. That results in hitting a BUG() with attempting
    to split 0 sectors.

    As this is just a performance issue and not a regression since
    4.3 release, let's just rever this change. That gives us more
    time to test a real fix for 4.5, which would be marked for
    stable anyway.

    Jens Axboe
     

29 Dec, 2015

1 commit


23 Dec, 2015

3 commits

  • For h/w that advertise their block storage's underlying chunk size, it's
    a big performance win to not submit commands that cross them. This patch
    uses that criteria if it is provided. If it is not provided, this patch
    uses the max sectors as before.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Pull block layer fixes from Jens Axboe:
    "Three small fixes for 4.4 final. Specifically:

    - The segment issue fix from Junichi, where the old IO path does a
    bio limit split before potentially bouncing the pages. We need to
    do that in the right order, to ensure that limitations are met.

    - A NVMe surprise removal IO hang fix from Keith.

    - A use-after-free in null_blk, introduced by a previous patch in
    this series. From Mike Krinkin"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    null_blk: fix use-after-free error
    block: ensure to split after potentially bouncing a bio
    NVMe: IO ending fixes on surprise removal

    Linus Torvalds
     
  • blk_queue_bio() does split then bounce, which makes the segment
    counting based on pages before bouncing and could go wrong. Move
    the split to after bouncing, like we do for blk-mq, and the we
    fix the issue of having the bio count for segments be wrong.

    Fixes: 54efd50bfd87 ("block: make generic_make_request handle arbitrarily sized bios")
    Cc: stable@vger.kernel.org
    Tested-by: Artem S. Tashkinov
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

13 Dec, 2015

1 commit

  • Pull block layer fixes from Jens Axboe:
    "A set of fixes for the current series. This contains:

    - A bunch of fixes for lightnvm, should be the last round for this
    series. From Matias and Wenwei.

    - A writeback detach inode fix from Ilya, also marked for stable.

    - A block (though it says SCSI) fix for an OOPS in SCSI runtime power
    management.

    - Module init error path fixes for null_blk from Minfei"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    null_blk: Fix error path in module initialization
    lightnvm: do not compile in debugging by default
    lightnvm: prevent gennvm module unload on use
    lightnvm: fix media mgr registration
    lightnvm: replace req queue with nvmdev for lld
    lightnvm: comments on constants
    lightnvm: check mm before use
    lightnvm: refactor spin_unlock in gennvm_get_blk
    lightnvm: put blks when luns configure failed
    lightnvm: use flags in rrpc_get_blk
    block: detach bdev inode from its wb in __blkdev_put()
    SCSI: Fix NULL pointer dereference in runtime PM

    Linus Torvalds
     

07 Dec, 2015

2 commits

  • The following commit which went into mainline through networking tree

    3b13758f51de ("cgroups: Allow dynamically changing net_classid")

    conflicts in net/core/netclassid_cgroup.c with the following pending
    fix in cgroup/for-4.4-fixes.

    1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

    The former separates out update_classid() from cgrp_attach() and
    updates it to walk all fds of all tasks in the target css so that it
    can be used from both migration and config change paths. The latter
    drops @css from cgrp_attach().

    Resolve the conflict by making cgrp_attach() call update_classid()
    with the css from the first task. We can revive @tset walking in
    cgrp_attach() but given that net_cls is v1 only where there always is
    only one target css during migration, this is fine.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Cc: Nina Schiff

    Tejun Heo
     
  • Pull SCSI fixes from James Bottomley:
    "This is quite a bumper crop of fixes: three from Arnd correcting
    various build issues in some configurations, a lock recursion in
    qla2xxx. Two potentially exploitable issues in hpsa and mvsas, a
    potential null deref in st, a revert of a bdi registration fix that
    turned out to cause even more problems, a set of fixes to allow people
    who only defined MPT2SAS to still work after the mpt2/mpt3sas merger
    and a couple of fixes for issues turned up by the hyper-v storvsc
    driver"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility
    Revert "scsi: Fix a bdi reregistration race"
    mpt3sas: Add dummy Kconfig option for backwards compatibility
    Fix a memory leak in scsi_host_dev_release()
    block/sd: Fix device-imposed transfer length limits
    scsi_debug: fix prevent_allow+verify regressions
    MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.
    sd: Make discard granularity match logical block size when LBPRZ=1
    scsi: hpsa: select CONFIG_SCSI_SAS_ATTR
    scsi: advansys needs ISA dma api for ISA support
    scsi_sysfs: protect against double execution of __scsi_remove_device()
    st: fix potential null pointer dereference.
    scsi: report 'INQUIRY result too short' once per host
    advansys: fix big-endian builds
    qla2xxx: Fix rwlock recursion
    hpsa: logical vs bitwise AND typo
    mvsas: don't allow negative timeouts
    mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag

    Linus Torvalds
     

04 Dec, 2015

5 commits

  • The routines in scsi_pm.c assume that if a runtime-PM callback is
    invoked for a SCSI device, it can only mean that the device's driver
    has asked the block layer to handle the runtime power management (by
    calling blk_pm_runtime_init(), which among other things sets q->dev).

    However, this assumption turns out to be wrong for things like the ses
    driver. Normally ses devices are not allowed to do runtime PM, but
    userspace can override this setting. If this happens, the kernel gets
    a NULL pointer dereference when blk_post_runtime_resume() tries to use
    the uninitialized q->dev pointer.

    This patch fixes the problem by checking q->dev in block layer before
    handle runtime PM. Since ses doesn't define any PM callbacks and call
    blk_pm_runtime_init(), the crash won't occur.

    This fixes Bugzilla #101371.
    https://bugzilla.kernel.org/show_bug.cgi?id=101371

    More discussion can be found from below link.
    http://marc.info/?l=linux-scsi&m=144163730531875&w=2

    Signed-off-by: Ken Xue
    Acked-by: Alan Stern
    Cc: Xiangliang Yu
    Cc: James E.J. Bottomley
    Cc: Jens Axboe
    Cc: Michael Terry
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Ken Xue
     
  • James Bottomley
     
  • There is a split tracepoint that is supposed to be called when
    bio is splitted, and it was called in bio_split function until
    commit 4b1faf931650d4a35b2a ("block: Kill bio_pair_split()").
    But now, no one reports splits, so this patch adds call to
    trace_block_split back in blk_queue_split right after split.

    Signed-off-by: Mike Krinkin
    Signed-off-by: Jens Axboe

    Mike Krinkin
     
  • In architecture like powerpc, we can have cpus without any local memory
    attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
    can result in memory allocation hints for block hctx->numa_node populated
    with node values which does not have real memory.

    Instead use local_memory_node(), which is guaranteed to have memory.
    local_memory_node is a noop in other architectures that does not support
    memoryless nodes.

    Signed-off-by: Raghavendra K T
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Raghavendra K T
     
  • hctx->cpumask is already populated and let the tag cpumask follow that
    instead of going through a new for loop.

    Signed-off-by: Raghavendra K T
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Raghavendra K T
     

03 Dec, 2015

1 commit

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

02 Dec, 2015

1 commit


01 Dec, 2015

1 commit


30 Nov, 2015

1 commit

  • When a cloned request is retried on other queues it always needs
    to be checked against the queue limits of that queue.
    Otherwise the calculations for nr_phys_segments might be wrong,
    leading to a crash in scsi_init_sgtable().

    To clarify this the patch renames blk_rq_check_limits()
    to blk_cloned_rq_check_limits() and removes the symbol
    export, as the new function should only be used for
    cloned requests and never exported.

    Cc: Mike Snitzer
    Cc: Ewan Milne
    Cc: Jeff Moyer
    Signed-off-by: Hannes Reinecke
    Fixes: e2a60da74 ("block: Clean up special command handling logic")
    Cc: stable@vger.kernel.org # 3.7+
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

26 Nov, 2015

4 commits

  • Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
    partition of sda is mounted (and will fail with EINVAL if pointed
    at a partition). But it will pass if the entire block device is
    formatted with a filesystem and mounted. I don't think this makes
    sense; partitioning should surely not ever change out from under
    a mounted device.

    So check for bdev->bd_super, and fail that with -EBUSY as well.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Eric Sandeen
     
  • Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
    had the unfortunate side-effect of removing an implicit clamp to
    BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
    code. This caused problems for some SMR drives.

    Debugging this issue revealed a few problems with the existing
    infrastructure since the block layer didn't know how to deal with
    device-imposed limits, only limits set by the I/O controller.

    - Introduce a new queue limit, max_dev_sectors, which is used by the
    ULD to signal the maximum sectors for a REQ_TYPE_FS request.

    - Ensure that max_dev_sectors is correctly stacked and taken into
    account when overriding max_sectors through sysfs.

    - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
    values for later processing.

    - In sd_revalidate() set the queue's max_dev_sectors based on the
    MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
    is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
    field size.

    - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
    VPD--if reported and sane--to signal the preferred device transfer
    size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

    - blk_limits_max_hw_sectors() is no longer used and can be removed.

    Signed-off-by: Martin K. Petersen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
    Reviewed-by: Christoph Hellwig
    Tested-by: sweeneygj@gmx.com
    Tested-by: Arzeets
    Tested-by: David Eisner
    Tested-by: Mario Kicherer
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     
  • This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.

    Jan writes:

    --

    Thanks for report! After some investigation I found out we allocate
    elevator specific data in __get_request() only for non-flush requests. And
    this is actually required since the flush machinery uses the space in
    struct request for something else. Doh. So my patch is just wrong and not
    easy to fix since at the time __get_request() is called we are not sure
    whether the flush machinery will be used in the end. Jens, please revert
    1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!

    I'm somewhat surprised that you can reliably hit the race where flushing
    gets disabled for the device just while the request is in flight. But I
    guess during boot it makes some sense.

    --

    So let's just revert it, we can fix the queue run manually after the
    fact. This race is rare enough that it didn't trigger in testing, it
    requires the specific disable-while-in-flight scenario to trigger.

    Jens Axboe
     
  • This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.

    Jan writes:

    --

    Thanks for report! After some investigation I found out we allocate
    elevator specific data in __get_request() only for non-flush requests. And
    this is actually required since the flush machinery uses the space in
    struct request for something else. Doh. So my patch is just wrong and not
    easy to fix since at the time __get_request() is called we are not sure
    whether the flush machinery will be used in the end. Jens, please revert
    1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!

    I'm somewhat surprised that you can reliably hit the race where flushing
    gets disabled for the device just while the request is in flight. But I
    guess during boot it makes some sense.

    --

    So let's just revert it, we can fix the queue run manually after the
    fact. This race is rare enough that it didn't trigger in testing, it
    requires the specific disable-while-in-flight scenario to trigger.

    Jens Axboe
     

25 Nov, 2015

7 commits

  • Just a comment update on not needing queue_lock, and that we aren't
    really adding the request to a timeout list for !mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Use offset_in_page macro instead of (addr & ~PAGE_MASK).

    Signed-off-by: Geliang Tang
    Signed-off-by: Jens Axboe

    Geliang Tang
     
  • This patch fixes the checkpatch.pl error to genhd.c:

    ERROR: do not initialise statics to 0 or NULL

    Signed-off-by: Wei Tang
    Signed-off-by: Jens Axboe

    Wei Tang
     
  • This patch fixes the checkpatch.pl error to blk-exec.c:

    ERROR: do not initialise globals to 0 or NULL

    Signed-off-by: Wei Tang
    Signed-off-by: Jens Axboe

    Wei Tang
     
  • Name the cache after the actual name of the struct.

    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • We only added the request to the request list for the !blk-mq case,
    so we should only delete it in that case as well.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pull block layer fixes from Jens Axboe:
    "A round of fixes/updates for the current series.

    This looks a little bigger than it is, but that's mainly because we
    pushed the lightnvm enabled null_blk change out of the merge window so
    it could be updated a bit. The rest of the volume is also mostly
    lightnvm. In particular:

    - Lightnvm. Various fixes, additions, updates from Matias and
    Javier, as well as from Wenwei Tao.

    - NVMe:
    - Fix for potential arithmetic overflow from Keith.
    - Also from Keith, ensure that we reap pending completions from
    a completion queue before deleting it. Fixes kernel crashes
    when resetting a device with IO pending.
    - Various little lightnvm related tweaks from Matias.

    - Fixup flushes to go through the IO scheduler, for the cases where a
    flush is not required. Fixes a case in CFQ where we would be
    idling and not see this request, hence not break the idling. From
    Jan Kara.

    - Use list_{first,prev,next} in elevator.c for cleaner code. From
    Gelian Tang.

    - Fix for a warning trigger on btrfs and raid on single queue blk-mq
    devices, where we would flush plug callbacks with preemption
    disabled. From me.

    - A mac partition validation fix from Kees Cook.

    - Two merge fixes from Ming, marked stable. A third part is adding a
    new warning so we'll notice this quicker in the future, if we screw
    up the accounting.

    - Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"

    * 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
    blk-merge: warn if figured out segment number is bigger than nr_phys_segments
    blk-merge: fix blk_bio_segment_split
    block: fix segment split
    blk-mq: fix calling unplug callbacks with preempt disabled
    mac: validate mac_partition is within sector
    mtip32xx: use formatting capability of kthread_create_on_node
    NVMe: reap completion entries when deleting queue
    lightnvm: add free and bad lun info to show luns
    lightnvm: keep track of block counts
    nvme: lightnvm: use admin queues for admin cmds
    lightnvm: missing free on init error
    lightnvm: wrong return value and redundant free
    null_blk: do not del gendisk with lightnvm
    null_blk: use device addressing mode
    null_blk: use ppa_cache pool
    NVMe: Fix possible arithmetic overflow for max segments
    blk-flush: Queue through IO scheduler when flush not required
    null_blk: register as a LightNVM device
    elevator: use list_{first,prev,next}_entry
    lightnvm: cleanup queue before target removal
    ...

    Linus Torvalds