05 May, 2016

1 commit

  • commit b30a337ca27c4f40439e4bfb290cba5f88d73bb7 upstream.

    The initialization of partition's percpu_ref should have been done before
    sending out KOBJ_ADD uevent, which may cause userspace to read partition
    table. So the uninitialized percpu_ref may be accessed in data path.

    This patch fixes this issue reported by Naveen.

    Reported-by: Naveen Kaje
    Tested-by: Naveen Kaje
    Fixes: 6c71013ecb7e2(block: partition: convert percpu ref)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

13 Apr, 2016

1 commit

  • commit 6acfe68bac7e6f16dc312157b1fa6e2368985013 upstream.

    Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
    than if an underlying null_blk device were used directly. One of the
    reasons for this drop in performance is that blk_insert_clone_request()
    was calling blk_mq_insert_request() with @async=true. This forced the
    use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
    which ushered in ping-ponging between process context (fio in this case)
    and kblockd's kworker to submit the cloned request. The ftrace
    function_graph tracer showed:

    kworker-2013 => fio-12190
    fio-12190 => kworker-2013
    ...
    kworker-2013 => fio-12190
    fio-12190 => kworker-2013
    ...

    Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
    _not_ use kblockd to submit the cloned requests isn't enough to
    eliminate the observed context switches.

    In addition to this dm-mq specific blk-core fix, there are 2 DM core
    fixes to dm-mq that (when paired with the blk-core fix) completely
    eliminate the observed context switching:

    1) don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this). So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e16ca2d !

    2) use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request). Using blk_mq_complete_request() for
    blk-mq requests is important for performance. It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

    dm-mq fix #2 is _much_ more important than #1 for eliminating the
    context switches.
    Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
    After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

    With these changes multithreaded async read IOPs improved from ~950K
    to ~1350K for this dm-mq stacked on null_blk test-case. The raw read
    IOPs of the underlying null_blk device for the same workload is ~1950K.

    Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
    Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
    Reported-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

10 Mar, 2016

1 commit

  • commit 5f009d3f8e6685fe8c6215082c1696a08b411220 upstream.

    The new queue limit is not used by the majority of block drivers, and
    should be initialized to 0 for the driver's requested settings to be used.

    Signed-off-by: Keith Busch
    Acked-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     

04 Mar, 2016

1 commit

  • commit 2d99b55d378c996b9692a0c93dd25f4ed5d58934 upstream.

    Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for
    current->mm to see if we have a user space context and only copies data
    if we do. Now if an IO gets interrupted by a signal data isn't copied
    into user space any more (as we don't have a user space context) but
    user space isn't notified about it.

    This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
    to notify userland that a signal has interrupted the syscall, otherwise
    it could lead to a situation where the caller may get a buffer with
    no data returned.

    This can be reproduced by issuing SG_IO ioctl()s in one thread while
    constantly sending signals to it.

    Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Hannes Reinecke
     

18 Feb, 2016

2 commits

  • commit d0e5fbb01a67e400e82fefe4896ea40c6447ab98 upstream.

    After commit e36f62042880(block: split bios to maxpossible length),
    bio can be splitted in the middle of a vector entry, then it
    is easy to split out one bio which size isn't aligned with block
    size, especially when the block size is bigger than 512.

    This patch fixes the issue by making the max io size aligned
    to logical block size.

    Fixes: e36f62042880(block: split bios to maxpossible length)
    Reported-by: Stefan Haberland
    Cc: Keith Busch
    Suggested-by: Linus Torvalds
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit e36f6204288088fda50d1c84830340ccb70f85ff upstream.

    This splits bio in the middle of a vector to form the largest possible
    bio at the h/w's desired alignment, and guarantees the bio being split
    will have some data.

    The criteria for splitting is changed from the max sectors to the h/w's
    optimal sector alignment if it is provided. For h/w that advertise their
    block storage's underlying chunk size, it's a big performance win to not
    submit commands that cross them. If sector alignment is not provided,
    this patch uses the max sectors as before.

    This addresses the performance issue commit d380561113 attempted to
    fix, but was reverted due to splitting logic error.

    Signed-off-by: Keith Busch
    Cc: Jens Axboe
    Cc: Ming Lei
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     

09 Jan, 2016

1 commit

  • This reverts commit d3805611130af9b911e908af9f67a3f64f4f0914.

    If we end up splitting on the first segment, we don't adjust
    the sector count. That results in hitting a BUG() with attempting
    to split 0 sectors.

    As this is just a performance issue and not a regression since
    4.3 release, let's just rever this change. That gives us more
    time to test a real fix for 4.5, which would be marked for
    stable anyway.

    Jens Axboe
     

29 Dec, 2015

1 commit


23 Dec, 2015

3 commits

  • For h/w that advertise their block storage's underlying chunk size, it's
    a big performance win to not submit commands that cross them. This patch
    uses that criteria if it is provided. If it is not provided, this patch
    uses the max sectors as before.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Pull block layer fixes from Jens Axboe:
    "Three small fixes for 4.4 final. Specifically:

    - The segment issue fix from Junichi, where the old IO path does a
    bio limit split before potentially bouncing the pages. We need to
    do that in the right order, to ensure that limitations are met.

    - A NVMe surprise removal IO hang fix from Keith.

    - A use-after-free in null_blk, introduced by a previous patch in
    this series. From Mike Krinkin"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    null_blk: fix use-after-free error
    block: ensure to split after potentially bouncing a bio
    NVMe: IO ending fixes on surprise removal

    Linus Torvalds
     
  • blk_queue_bio() does split then bounce, which makes the segment
    counting based on pages before bouncing and could go wrong. Move
    the split to after bouncing, like we do for blk-mq, and the we
    fix the issue of having the bio count for segments be wrong.

    Fixes: 54efd50bfd87 ("block: make generic_make_request handle arbitrarily sized bios")
    Cc: stable@vger.kernel.org
    Tested-by: Artem S. Tashkinov
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

13 Dec, 2015

1 commit

  • Pull block layer fixes from Jens Axboe:
    "A set of fixes for the current series. This contains:

    - A bunch of fixes for lightnvm, should be the last round for this
    series. From Matias and Wenwei.

    - A writeback detach inode fix from Ilya, also marked for stable.

    - A block (though it says SCSI) fix for an OOPS in SCSI runtime power
    management.

    - Module init error path fixes for null_blk from Minfei"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    null_blk: Fix error path in module initialization
    lightnvm: do not compile in debugging by default
    lightnvm: prevent gennvm module unload on use
    lightnvm: fix media mgr registration
    lightnvm: replace req queue with nvmdev for lld
    lightnvm: comments on constants
    lightnvm: check mm before use
    lightnvm: refactor spin_unlock in gennvm_get_blk
    lightnvm: put blks when luns configure failed
    lightnvm: use flags in rrpc_get_blk
    block: detach bdev inode from its wb in __blkdev_put()
    SCSI: Fix NULL pointer dereference in runtime PM

    Linus Torvalds
     

07 Dec, 2015

2 commits

  • The following commit which went into mainline through networking tree

    3b13758f51de ("cgroups: Allow dynamically changing net_classid")

    conflicts in net/core/netclassid_cgroup.c with the following pending
    fix in cgroup/for-4.4-fixes.

    1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

    The former separates out update_classid() from cgrp_attach() and
    updates it to walk all fds of all tasks in the target css so that it
    can be used from both migration and config change paths. The latter
    drops @css from cgrp_attach().

    Resolve the conflict by making cgrp_attach() call update_classid()
    with the css from the first task. We can revive @tset walking in
    cgrp_attach() but given that net_cls is v1 only where there always is
    only one target css during migration, this is fine.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Cc: Nina Schiff

    Tejun Heo
     
  • Pull SCSI fixes from James Bottomley:
    "This is quite a bumper crop of fixes: three from Arnd correcting
    various build issues in some configurations, a lock recursion in
    qla2xxx. Two potentially exploitable issues in hpsa and mvsas, a
    potential null deref in st, a revert of a bdi registration fix that
    turned out to cause even more problems, a set of fixes to allow people
    who only defined MPT2SAS to still work after the mpt2/mpt3sas merger
    and a couple of fixes for issues turned up by the hyper-v storvsc
    driver"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility
    Revert "scsi: Fix a bdi reregistration race"
    mpt3sas: Add dummy Kconfig option for backwards compatibility
    Fix a memory leak in scsi_host_dev_release()
    block/sd: Fix device-imposed transfer length limits
    scsi_debug: fix prevent_allow+verify regressions
    MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.
    sd: Make discard granularity match logical block size when LBPRZ=1
    scsi: hpsa: select CONFIG_SCSI_SAS_ATTR
    scsi: advansys needs ISA dma api for ISA support
    scsi_sysfs: protect against double execution of __scsi_remove_device()
    st: fix potential null pointer dereference.
    scsi: report 'INQUIRY result too short' once per host
    advansys: fix big-endian builds
    qla2xxx: Fix rwlock recursion
    hpsa: logical vs bitwise AND typo
    mvsas: don't allow negative timeouts
    mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag

    Linus Torvalds
     

04 Dec, 2015

2 commits

  • The routines in scsi_pm.c assume that if a runtime-PM callback is
    invoked for a SCSI device, it can only mean that the device's driver
    has asked the block layer to handle the runtime power management (by
    calling blk_pm_runtime_init(), which among other things sets q->dev).

    However, this assumption turns out to be wrong for things like the ses
    driver. Normally ses devices are not allowed to do runtime PM, but
    userspace can override this setting. If this happens, the kernel gets
    a NULL pointer dereference when blk_post_runtime_resume() tries to use
    the uninitialized q->dev pointer.

    This patch fixes the problem by checking q->dev in block layer before
    handle runtime PM. Since ses doesn't define any PM callbacks and call
    blk_pm_runtime_init(), the crash won't occur.

    This fixes Bugzilla #101371.
    https://bugzilla.kernel.org/show_bug.cgi?id=101371

    More discussion can be found from below link.
    http://marc.info/?l=linux-scsi&m=144163730531875&w=2

    Signed-off-by: Ken Xue
    Acked-by: Alan Stern
    Cc: Xiangliang Yu
    Cc: James E.J. Bottomley
    Cc: Jens Axboe
    Cc: Michael Terry
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Ken Xue
     
  • James Bottomley
     

03 Dec, 2015

1 commit

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

01 Dec, 2015

1 commit


30 Nov, 2015

1 commit

  • When a cloned request is retried on other queues it always needs
    to be checked against the queue limits of that queue.
    Otherwise the calculations for nr_phys_segments might be wrong,
    leading to a crash in scsi_init_sgtable().

    To clarify this the patch renames blk_rq_check_limits()
    to blk_cloned_rq_check_limits() and removes the symbol
    export, as the new function should only be used for
    cloned requests and never exported.

    Cc: Mike Snitzer
    Cc: Ewan Milne
    Cc: Jeff Moyer
    Signed-off-by: Hannes Reinecke
    Fixes: e2a60da74 ("block: Clean up special command handling logic")
    Cc: stable@vger.kernel.org # 3.7+
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

26 Nov, 2015

3 commits

  • Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
    partition of sda is mounted (and will fail with EINVAL if pointed
    at a partition). But it will pass if the entire block device is
    formatted with a filesystem and mounted. I don't think this makes
    sense; partitioning should surely not ever change out from under
    a mounted device.

    So check for bdev->bd_super, and fail that with -EBUSY as well.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Eric Sandeen
     
  • Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
    had the unfortunate side-effect of removing an implicit clamp to
    BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
    code. This caused problems for some SMR drives.

    Debugging this issue revealed a few problems with the existing
    infrastructure since the block layer didn't know how to deal with
    device-imposed limits, only limits set by the I/O controller.

    - Introduce a new queue limit, max_dev_sectors, which is used by the
    ULD to signal the maximum sectors for a REQ_TYPE_FS request.

    - Ensure that max_dev_sectors is correctly stacked and taken into
    account when overriding max_sectors through sysfs.

    - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
    values for later processing.

    - In sd_revalidate() set the queue's max_dev_sectors based on the
    MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
    is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
    field size.

    - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
    VPD--if reported and sane--to signal the preferred device transfer
    size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

    - blk_limits_max_hw_sectors() is no longer used and can be removed.

    Signed-off-by: Martin K. Petersen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
    Reviewed-by: Christoph Hellwig
    Tested-by: sweeneygj@gmx.com
    Tested-by: Arzeets
    Tested-by: David Eisner
    Tested-by: Mario Kicherer
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     
  • This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.

    Jan writes:

    --

    Thanks for report! After some investigation I found out we allocate
    elevator specific data in __get_request() only for non-flush requests. And
    this is actually required since the flush machinery uses the space in
    struct request for something else. Doh. So my patch is just wrong and not
    easy to fix since at the time __get_request() is called we are not sure
    whether the flush machinery will be used in the end. Jens, please revert
    1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!

    I'm somewhat surprised that you can reliably hit the race where flushing
    gets disabled for the device just while the request is in flight. But I
    guess during boot it makes some sense.

    --

    So let's just revert it, we can fix the queue run manually after the
    fact. This race is rare enough that it didn't trigger in testing, it
    requires the specific disable-while-in-flight scenario to trigger.

    Jens Axboe
     

25 Nov, 2015

2 commits

  • We only added the request to the request list for the !blk-mq case,
    so we should only delete it in that case as well.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pull block layer fixes from Jens Axboe:
    "A round of fixes/updates for the current series.

    This looks a little bigger than it is, but that's mainly because we
    pushed the lightnvm enabled null_blk change out of the merge window so
    it could be updated a bit. The rest of the volume is also mostly
    lightnvm. In particular:

    - Lightnvm. Various fixes, additions, updates from Matias and
    Javier, as well as from Wenwei Tao.

    - NVMe:
    - Fix for potential arithmetic overflow from Keith.
    - Also from Keith, ensure that we reap pending completions from
    a completion queue before deleting it. Fixes kernel crashes
    when resetting a device with IO pending.
    - Various little lightnvm related tweaks from Matias.

    - Fixup flushes to go through the IO scheduler, for the cases where a
    flush is not required. Fixes a case in CFQ where we would be
    idling and not see this request, hence not break the idling. From
    Jan Kara.

    - Use list_{first,prev,next} in elevator.c for cleaner code. From
    Gelian Tang.

    - Fix for a warning trigger on btrfs and raid on single queue blk-mq
    devices, where we would flush plug callbacks with preemption
    disabled. From me.

    - A mac partition validation fix from Kees Cook.

    - Two merge fixes from Ming, marked stable. A third part is adding a
    new warning so we'll notice this quicker in the future, if we screw
    up the accounting.

    - Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"

    * 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
    blk-merge: warn if figured out segment number is bigger than nr_phys_segments
    blk-merge: fix blk_bio_segment_split
    block: fix segment split
    blk-mq: fix calling unplug callbacks with preempt disabled
    mac: validate mac_partition is within sector
    mtip32xx: use formatting capability of kthread_create_on_node
    NVMe: reap completion entries when deleting queue
    lightnvm: add free and bad lun info to show luns
    lightnvm: keep track of block counts
    nvme: lightnvm: use admin queues for admin cmds
    lightnvm: missing free on init error
    lightnvm: wrong return value and redundant free
    null_blk: do not del gendisk with lightnvm
    null_blk: use device addressing mode
    null_blk: use ppa_cache pool
    NVMe: Fix possible arithmetic overflow for max segments
    blk-flush: Queue through IO scheduler when flush not required
    null_blk: register as a LightNVM device
    elevator: use list_{first,prev,next}_entry
    lightnvm: cleanup queue before target removal
    ...

    Linus Torvalds
     

24 Nov, 2015

3 commits

  • We had seen lots of reports of this kind issue, so add one
    warnning in blk-merge, then it can be triggered easily and
    avoid to depend on warning/bug from drivers.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit bdced438acd83a(block: setup bi_phys_segments after
    splitting) introduces function of computing bio->bi_phys_segments
    during bio splitting.

    Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
    arn't computed, so too many physical segments may be obtained
    for one request since both the two are used to check if one segment
    across two bios can be possible.

    This patch fixes the issue by computing the two variables in
    blk_bio_segment_split().

    Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
    Reported-by: Michael Ellerman
    Reported-by: Mark Salter
    Tested-by: Laurent Dufour
    Tested-by: Mark Salter
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
    always points to the iterator local variable, which is obviously
    wrong, so fix it by pointing to the local variable of 'bvprv'.

    Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
    Cc: stable@kernel.org #4.3
    Reported-by: Michael Ellerman
    Reported-by: Mark Salter
    Tested-by: Laurent Dufour
    Tested-by: Mark Salter
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

21 Nov, 2015

1 commit

  • Liu reported that running certain parts of xfstests threw the
    following error:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
    in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
    3 locks held by kworker/u16:0/6:
    #0: ("writeback"){++++.+}, at: [] process_one_work+0x173/0x730
    #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x173/0x730
    #2: (&type->s_umount_key#44){+++++.}, at: [] trylock_super+0x25/0x60
    CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
    Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
    Workqueue: writeback wb_workfn (flush-btrfs-108)
    ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
    0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
    ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
    Call Trace:
    [] dump_stack+0x4f/0x74
    [] ___might_sleep+0x185/0x240
    [] __might_sleep+0x52/0x90
    [] __alloc_pages_nodemask+0x268/0x410
    [] ? sched_clock_local+0x1c/0x90
    [] ? local_clock+0x21/0x40
    [] ? __lock_release+0x420/0x510
    [] ? __lock_acquired+0x16c/0x3c0
    [] alloc_pages_current+0xc5/0x210
    [] ? rbio_is_full+0x55/0x70 [btrfs]
    [] ? mark_held_locks+0x78/0xa0
    [] ? _raw_spin_unlock_irqrestore+0x40/0x60
    [] full_stripe_write+0x5a/0xc0 [btrfs]
    [] __raid56_parity_write+0x39/0x60 [btrfs]
    [] run_plug+0x11b/0x140 [btrfs]
    [] btrfs_raid_unplug+0x23/0x70 [btrfs]
    [] blk_flush_plug_list+0x82/0x1f0
    [] blk_sq_make_request+0x1f9/0x740
    [] ? generic_make_request_checks+0x222/0x7c0
    [] ? blk_queue_enter+0x124/0x310
    [] ? blk_queue_enter+0x92/0x310
    [] generic_make_request+0x172/0x2c0
    [] ? generic_make_request+0x164/0x2c0
    [] submit_bio+0x70/0x140
    [] ? rbio_add_io_page+0x99/0x150 [btrfs]
    [] finish_rmw+0x4d9/0x600 [btrfs]
    [] full_stripe_write+0x9c/0xc0 [btrfs]
    [] raid56_parity_write+0xef/0x160 [btrfs]
    [] btrfs_map_bio+0xe3/0x2d0 [btrfs]
    [] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
    [] submit_one_bio+0x74/0xb0 [btrfs]
    [] submit_extent_page+0xe5/0x1c0 [btrfs]
    [] __extent_writepage_io+0x408/0x4c0 [btrfs]
    [] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
    [] __extent_writepage+0x218/0x3a0 [btrfs]
    [] ? mark_held_locks+0x78/0xa0
    [] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
    [] extent_writepages+0x52/0x70 [btrfs]
    [] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
    [] btrfs_writepages+0x27/0x30 [btrfs]
    [] do_writepages+0x23/0x40
    [] __writeback_single_inode+0x89/0x4d0
    [] ? writeback_sb_inodes+0x260/0x480
    [] ? writeback_sb_inodes+0x260/0x480
    [] ? writeback_sb_inodes+0x15f/0x480
    [] writeback_sb_inodes+0x2d2/0x480
    [] ? down_read_trylock+0x57/0x60
    [] ? trylock_super+0x25/0x60
    [] ? rcu_read_lock_sched_held+0x4f/0x90
    [] __writeback_inodes_wb+0x8c/0xc0
    [] wb_writeback+0x2b5/0x500
    [] ? mark_held_locks+0x78/0xa0
    [] ? __local_bh_enable_ip+0x68/0xc0
    [] ? wb_do_writeback+0x62/0x310
    [] wb_do_writeback+0xc1/0x310
    [] ? set_worker_desc+0x79/0x90
    [] wb_workfn+0x92/0x330
    [] process_one_work+0x223/0x730
    [] ? process_one_work+0x173/0x730
    [] ? worker_thread+0x18f/0x430
    [] worker_thread+0x11d/0x430
    [] ? maybe_create_worker+0xf0/0xf0
    [] ? maybe_create_worker+0xf0/0xf0
    [] kthread+0xef/0x110
    [] ? schedule_tail+0x1e/0xd0
    [] ? __init_kthread_worker+0x70/0x70
    [] ret_from_fork+0x3f/0x70
    [] ? __init_kthread_worker+0x70/0x70

    The issue is that we've got the software context pinned while
    calling blk_flush_plug_list(), which flushes callbacks that
    are allowed to sleep. btrfs and raid has such callbacks.

    Flip the checks around a bit, so we can enable preempt a bit
    earlier and flush plugs without having preempt disabled.

    This only affects blk-mq driven devices, and only those that
    register a single queue.

    Reported-by: Liu Bo
    Tested-by: Liu Bo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Nov, 2015

2 commits

  • If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
    512 byte sector would be read (secsize / 512). However the partition
    structure would be located past the end of the buffer (secsize % 512).

    Signed-off-by: Kees Cook
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Kees Cook
     
  • Fix use after free crashes like the following:

    general protection fault: 0000 [#1] SMP
    Call Trace:
    [] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
    [] pmem_rw_page+0x42/0x80 [nd_pmem]
    [] bdev_read_page+0x50/0x60
    [] do_mpage_readpage+0x510/0x770
    [] ? I_BDEV+0x20/0x20
    [] ? lru_cache_add+0x1c/0x50
    [] mpage_readpages+0x107/0x170
    [] ? I_BDEV+0x20/0x20
    [] ? I_BDEV+0x20/0x20
    [] blkdev_readpages+0x1d/0x20
    [] __do_page_cache_readahead+0x28f/0x310
    [] ? __do_page_cache_readahead+0x169/0x310
    [] ? pagecache_get_page+0x2d/0x1d0
    [] filemap_fault+0x396/0x530
    [] __do_fault+0x4e/0xf0
    [] handle_mm_fault+0x11bd/0x1b50

    Cc:
    Cc: Jens Axboe
    Cc: Alexander Viro
    Reported-by: kbuild test robot
    Acked-by: Matthew Wilcox
    [willy: symmetry fixups]
    Signed-off-by: Dan Williams

    Dan Williams
     

17 Nov, 2015

2 commits

  • Currently blk_insert_flush() just adds flush request to q->queue_head
    when flush is not required. That completely bypasses IO scheduler so
    e.g. CFQ can be idling waiting for new request to arrive and will idle
    through the whole window unnecessarily. Luckily this only happens in
    rare cases as usually checks in generic_make_request_checks() clear
    FLUSH and FUA flags early if they are not needed.

    When no flushing is actually required, we can easily fix the problem by
    properly queueing the request through the IO scheduler. Ideally IO
    scheduler should be also made aware of requests queued via
    blk_flush_queue_rq(). However inserting flush request through IO
    scheduler can have unwanted side-effects since due to flush batching
    delaying the flush request in IO scheduler will delay all flush requests
    possibly coming from other processes. So we keep adding the request
    directly to q->queue_head.

    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • To make the intention clearer, use list_{first,prev,next}_entry
    instead of list_entry.

    Signed-off-by: Geliang Tang
    Signed-off-by: Jens Axboe

    Geliang Tang
     

12 Nov, 2015

2 commits


11 Nov, 2015

1 commit

  • Pull block IO poll support from Jens Axboe:
    "Various groups have been doing experimentation around IO polling for
    (really) fast devices. The code has been reviewed and has been
    sitting on the side for a few releases, but this is now good enough
    for coordinated benchmarking and further experimentation.

    Currently O_DIRECT sync read/write are supported. A framework is in
    the works that allows scalable stats tracking so we can auto-tune
    this. And we'll add libaio support as well soon. Fow now, it's an
    opt-in feature for test purposes"

    * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
    direct-io: be sure to assign dio->bio_bdev for both paths
    directio: add block polling support
    NVMe: add blk polling support
    block: add block polling support
    blk-mq: return tag/queue combo in the make_request_fn handlers
    block: change ->make_request_fn() and users to return a queue cookie

    Linus Torvalds
     

08 Nov, 2015

3 commits

  • Add basic support for polling for specific IO to complete. This uses
    the cookie that blk-mq passes back, which enables the block layer
    to pass this cookie to the driver to spin for a specific request.

    This will be combined with request latency tracking, so we can make
    qualified decisions about when to poll and when not to. For now, for
    benchmark purposes, we add a sysfs file that controls whether polling
    is enabled or not.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     
  • Return a cookie, blk_qc_t, from the blk-mq make request functions, that
    allows a later caller to uniquely identify a specific IO. The cookie
    doesn't mean anything to the caller, but the caller can use it to later
    pass back to the block layer. The block layer can then identify the
    hardware queue and request from that cookie.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     
  • No functional changes in this patch, but it prepares us for returning
    a more useful cookie related to the IO that was queued up.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     

07 Nov, 2015

2 commits

  • setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
    the current pid namespace. This is in contrast to both the other modes of
    setpriority and the example of kill(-1). Fix this. getpriority and
    ioprio have the same failure mode, fix them too.

    Eric said:

    : After some more thinking about it this patch sounds justifiable.
    :
    : My goal with namespaces is not to build perfect isolation mechanisms
    : as that can get into ill defined territory, but to build well defined
    : mechanisms. And to handle the corner cases so you can use only
    : a single namespace with well defined results.
    :
    : In this case you have found the two interfaces I am aware of that
    : identify processes by uid instead of by pid. Which quite frankly is
    : weird. Unfortunately the weird unexpected cases are hard to handle
    : in the usual way.
    :
    : I was hoping for a little more information. Changes like this one we
    : have to be careful of because someone might be depending on the current
    : behavior. I don't think they are and I do think this make sense as part
    : of the pid namespace.

    Signed-off-by: Ben Segall
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Ambrose Feinstein
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Segall
     
  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman