21 Sep, 2016

1 commit


17 Sep, 2016

5 commits

  • In order to get good cache behavior from a sbitmap, we want each CPU to
    stick to its own cacheline(s) as much as possible. This might happen
    naturally as the bitmap gets filled up and the alloc_hint values spread
    out, but we really want this behavior from the start. blk-mq apparently
    intended to do this, but the code to do this was never wired up. Get rid
    of the dead code and make it part of the sbitmap library.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Again, there's no point in passing this in every time. Make it part of
    struct sbitmap_queue and clean up the API.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • We currently account a '0' dispatch, and anything above that still falls
    below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
    we don't account it.

    Change the last bucket to be inclusive of anything above the range we
    track, and have the sysfs file reflect that by including a '+' in the
    output:

    $ cat /sys/block/nvme0n1/mq/0/dispatched
    0 1006
    1 20229
    2 1
    4 0
    8 0
    16 0
    32+ 0

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

15 Sep, 2016

1 commit

  • blk_mq_delay_kick_requeue_list() provides the ability to kick the
    q->requeue_list after a specified time. To do this the request_queue's
    'requeue_work' member was changed to a delayed_work.

    blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
    requests while it doesn't make sense to immediately requeue them
    (e.g. when all paths in a DM multipath have failed).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

14 Sep, 2016

3 commits

  • commit e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1
    "block: Do away with the notion of hardsect_size"
    removed the notion of "hardware sector size" from
    the kernel in favor of logical block size, but
    references remain in comments and documentation.

    Update the remaining sites mentioning hardsect.

    Signed-off-by: Linus Walleij
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Linus Walleij
     
  • Allow the io_poll statistics to be zeroed to make for easier logging
    of polling event.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     
  • In order to help determine the effectiveness of polling in a running
    system it is usful to determine the ratio of how often the poll
    function is called vs how often the completion is checked. For this
    reason we add a poll_considered variable and add it to the sysfs entry
    for io_poll.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     

29 Aug, 2016

3 commits


25 Aug, 2016

2 commits

  • __blk_mq_run_hw_queue() currently warns if we are running the queue on a
    CPU that isn't set in its mask. However, this can happen if a CPU is
    being offlined, and the workqueue handling will place the work on CPU0
    instead. Improve the warning so that it only triggers if the batch cpu
    in the hardware queue is currently online. If it triggers for that
    case, then it's indicative of a flow problem in blk-mq, so we want to
    retain it for that case.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We do this in a few places, if the CPU is offline. This isn't allowed,
    though, since on multi queue hardware, we can't just move a request
    from one software queue to another, if they map to different hardware
    queues. The request and tag isn't valid on another hardware queue.

    This can happen if plugging races with CPU offlining. But it does
    no harm, since it can only happen in the window where we are
    currently busy freezing the queue and flushing IO, in preparation
    for redoing the software hardware queue mappings.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Aug, 2016

1 commit

  • After arbitrary bio size was introduced, the incoming bio may
    be very big. We have to split the bio into small bios so that
    each holds at most BIO_MAX_PAGES bvecs for safety reason, such
    as bio_clone().

    This patch fixes the following kernel crash:

    > [ 172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    > [ 172.660229] IP: [] bio_trim+0xf/0x2a
    > [ 172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
    > [ 172.660399] Oops: 0000 [#1] SMP
    > [...]
    > [ 172.664780] Call Trace:
    > [ 172.664813] [] ? raid1_make_request+0x2e8/0xad7 [raid1]
    > [ 172.664846] [] ? blk_queue_split+0x377/0x3d4
    > [ 172.664880] [] ? md_make_request+0xf6/0x1e9 [md_mod]
    > [ 172.664912] [] ? generic_make_request+0xb5/0x155
    > [ 172.664947] [] ? prio_io+0x85/0x95 [bcache]
    > [ 172.664981] [] ? register_cache_set+0x355/0x8d0 [bcache]
    > [ 172.665016] [] ? register_bcache+0x1006/0x1174 [bcache]

    The issue can be reproduced by the following steps:
    - create one raid1 over two virtio-blk
    - build bcache device over the above raid1 and another cache device
    and bucket size is set as 2Mbytes
    - set cache mode as writeback
    - run random write over ext4 on the bcache device

    Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
    Reported-by: Sebastian Roesner
    Reported-by: Eric Wheeler
    Cc: stable@vger.kernel.org (4.3+)
    Cc: Shaohua Li
    Acked-by: Kent Overstreet
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Aug, 2016

1 commit

  • blk_set_queue_dying() can be called while another thread is
    submitting I/O or changing queue flags, e.g. through dm_stop_queue().
    Hence protect the QUEUE_FLAG_DYING flag change with locking.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Cc: stable
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

16 Aug, 2016

1 commit

  • Commit 288dab8a35a0 ("block: add a separate operation type for secure
    erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
    all the places REQ_OP_DISCARD was being used to mean either. Fix those.

    Signed-off-by: Adrian Hunter
    Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
    Signed-off-by: Jens Axboe

    Adrian Hunter
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

6 commits

  • If we fail registering any of the hardware queues, we call
    into blk_mq_unregister_disk() with the hotplug mutex already
    held. Since blk_mq_unregister_disk() attempts to acquire the
    same mutex, we end up in a less than happy place.

    Reported-by: Jinpu Wang
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     
  • In case a submitted request gets stuck for some reason, the block layer
    can prevent the request starvation by starting the scheduled timeout work.
    If this stuck request occurs at the same time another thread has started
    a queue freeze, the blk_mq_timeout_work will not be able to acquire the
    queue reference and will return silently, thus not issuing the timeout.
    But since the request is already holding a q_usage_counter reference and
    is unable to complete, it will never release its reference, preventing
    the queue from completing the freeze started by first thread. This puts
    the request_queue in a hung state, forever waiting for the freeze
    completion.

    This was observed while running IO to a NVMe device at the same time we
    toggled the CPU hotplug code. Eventually, once a request got stuck
    requiring a timeout during a queue freeze, we saw the CPU Hotplug
    notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
    the trace below.

    [c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
    [c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
    [c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
    [c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
    [c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
    [c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
    [c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
    [c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
    [c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
    [c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
    [c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
    [c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
    [c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
    [c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
    [c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
    [c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
    [c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
    [c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
    [c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
    [c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

    The fix is to allow the timeout work to execute in the window between
    dropping the initial refcount reference and the release of the last
    reference, which actually marks the freeze completion. This can be
    achieved with percpu_refcount_tryget, which does not require the counter
    to be alive. This way the timeout work can do it's job and terminate a
    stuck request even during a freeze, returning its reference and avoiding
    the deadlock.

    Allowing the timeout to run is just a part of the fix, since for some
    devices, we might get stuck again inside the device driver's timeout
    handler, should it attempt to allocate a new request in that path -
    which is a quite common action for Abort commands, which need to be sent
    after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
    from inside the timeout handler, which will fail during a freeze, since
    it also tries to acquire a queue reference.

    I considered a similar change to blk_mq_alloc_request as a generic
    solution for further device driver hangs, but we can't do that, since it
    would allow new requests to disturb the freeze process. I thought about
    creating a new function in the block layer to support unfreezable
    requests for these occasions, but after working on it for a while, I
    feel like this should be handled in a per-driver basis. I'm now
    experimenting with changes to the NVMe timeout path, but I'm open to
    suggestions of ways to make this generic.

    Signed-off-by: Gabriel Krisman Bertazi
    Cc: Brian King
    Cc: Keith Busch
    Cc: linux-nvme@lists.infradead.org
    Cc: linux-block@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     
  • I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
    ___slab_alloc+0x4f1/0x520
    __slab_alloc.isra.58+0x56/0x80
    kmem_cache_alloc_trace+0x260/0x2a0
    disk_seqf_start+0x66/0x110
    traverse+0x176/0x860
    seq_read+0x7e3/0x11a0
    proc_reg_read+0xbc/0x180
    do_loop_readv_writev+0x134/0x210
    do_readv_writev+0x565/0x660
    vfs_readv+0x67/0xa0
    do_preadv+0x126/0x170
    SyS_preadv+0xc/0x10
    do_syscall_64+0x1a1/0x460
    return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
    __slab_free+0x17a/0x2c0
    kfree+0x20a/0x220
    disk_seqf_stop+0x42/0x50
    traverse+0x3b5/0x860
    seq_read+0x7e3/0x11a0
    proc_reg_read+0xbc/0x180
    do_loop_readv_writev+0x134/0x210
    do_readv_writev+0x565/0x660
    vfs_readv+0x67/0xa0
    do_preadv+0x126/0x170
    SyS_preadv+0xc/0x10
    do_syscall_64+0x1a1/0x460
    return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G B 4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
    ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
    ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
    [] dump_stack+0x65/0x84
    [] print_trailer+0x10d/0x1a0
    [] object_err+0x2f/0x40
    [] kasan_report_error+0x221/0x520
    [] __asan_report_load8_noabort+0x3e/0x40
    [] klist_iter_exit+0x61/0x70
    [] class_dev_iter_exit+0x9/0x10
    [] disk_seqf_stop+0x3a/0x50
    [] seq_read+0x4b2/0x11a0
    [] proc_reg_read+0xbc/0x180
    [] do_loop_readv_writev+0x134/0x210
    [] do_readv_writev+0x565/0x660
    [] vfs_readv+0x67/0xa0
    [] do_preadv+0x126/0x170
    [] SyS_preadv+0xc/0x10

    This problem can occur in the following situation:

    open()
    - pread()
    - .seq_start()
    - iter = kmalloc() // succeeds
    - seqf->private = iter
    - .seq_stop()
    - kfree(seqf->private)
    - pread()
    - .seq_start()
    - iter = kmalloc() // fails
    - .seq_stop()
    - class_dev_iter_exit(seqf->private) // boom! old pointer

    As the comment in disk_seqf_stop() says, stop is called even if start
    failed, so we need to reinitialise the private pointer to NULL when seq
    iteration stops.

    An alternative would be to set the private pointer to NULL when the
    kmalloc() in disk_seqf_start() fails.

    Cc: stable@vger.kernel.org
    Signed-off-by: Vegard Nossum
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Vegard Nossum
     
  • When a bio is cloned, the newly created bio must be associated with
    the same blkcg as the original bio (if BLK_CGROUP is enabled). If
    this operation is not performed, then the new bio is not associated
    with any group, and the group of the current task is returned when
    the group of the bio is requested.

    Depending on the cloning frequency, this may cause a large
    percentage of the bios belonging to a given group to be treated
    as if belonging to other groups (in most cases as if belonging to
    the root group). The expected group isolation may thereby be broken.

    This commit adds the missing association in bio-cloning functions.

    Fixes: da2f0f74cf7d ("Btrfs: add support for blkio controllers")
    Cc: stable@vger.kernel.org # v4.3+

    Signed-off-by: Paolo Valente
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • 'nr_undestroyed_grps' in struct throtl_data was used to count
    the number of throtl_grp related with throtl_data, but now
    throtl_grp is tracked by blkcg_gq, so it is useless anymore.

    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     

04 Aug, 2016

1 commit

  • The functionality for block device DAX was already removed with commit
    acc93d30d7d4 ("Revert "block: enable dax for raw block devices"")

    However, we still had a config option hanging around that was always
    disabled because it depended on CONFIG_BROKEN. This config option was
    introduced in commit 03cdadb04077 ("block: disable block device DAX by
    default")

    This change reverts that commit, removing the dead config option.

    Link: http://lkml.kernel.org/r/20160729182314.6368-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Acked-by: Dan Williams
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

26 Jul, 2016

1 commit

  • Pull timer updates from Thomas Gleixner:
    "This update provides the following changes:

    - The rework of the timer wheel which addresses the shortcomings of
    the current wheel (cascading, slow search for next expiring timer,
    etc). That's the first major change of the wheel in almost 20
    years since Finn implemted it.

    - A large overhaul of the clocksource drivers init functions to
    consolidate the Device Tree initialization

    - Some more Y2038 updates

    - A capability fix for timerfd

    - Yet another clock chip driver

    - The usual pile of updates, comment improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
    tick/nohz: Optimize nohz idle enter
    clockevents: Make clockevents_subsys static
    clocksource/drivers/time-armada-370-xp: Fix return value check
    timers: Implement optimization for same expiry time in mod_timer()
    timers: Split out index calculation
    timers: Only wake softirq if necessary
    timers: Forward the wheel clock whenever possible
    timers/nohz: Remove pointless tick_nohz_kick_tick() function
    timers: Optimize collect_expired_timers() for NOHZ
    timers: Move __run_timers() function
    timers: Remove set_timer_slack() leftovers
    timers: Switch to a non-cascading wheel
    timers: Reduce the CPU index space to 256k
    timers: Give a few structs and members proper names
    hlist: Add hlist_is_singular_node() helper
    signals: Use hrtimer for sigtimedwait()
    timers: Remove the deprecated mod_timer_pinned() API
    timers, net/ipv4/inet: Initialize connection request timers as pinned
    timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
    timers, drivers/tty/metag_da: Initialize the poll timer as pinned
    ...

    Linus Torvalds
     

21 Jul, 2016

11 commits

  • For a front merge, the maximum number of sectors of the
    request must be checked against the front merge BIO sector,
    not the current sector of the request.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     
  • Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Provides the ability to identify DAX enabled devices in userspace.

    Signed-off-by: Yigal Korman
    Signed-off-by: Toshi Kani
    Acked-by: Dan Williams
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Yigal Korman
     
  • They are unused and potential new users really should use the
    blk_rq_map* versions.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • I wish the OSD code could simply use blk_rq_map_* helpers like
    everyone else, but the complex nature of deciding if we have
    DATA IN and/or DATA OUT buffers might make this impossible
    (at least for a mere human like me).

    But using blk_rq_append_bio at least allows sharing the setup code
    between request with or without dat a buffers, and given that this
    is the last user of blk_make_request it allows getting rid of that
    somewhat awkward interface.

    Signed-off-by: Christoph Hellwig
    Acked-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The target SCSI passthrough backend is much better served with the low-level
    blk_rq_append_bio construct then the helpers built on top of it, so export it.

    Also use the opportunity to remove the pointless request_queue argument and
    make the code flow a little more readable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_get_request is used for BLOCK_PC and similar passthrough requests.
    Currently we always need to call blk_rq_set_block_pc or an open coded
    version of it to allow appending bios using the request mapping helpers
    later on, which is a somewhat awkward API. Instead move the
    initialization part of blk_rq_set_block_pc into blk_get_request, so that
    we always have a safe to use request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Instead of a flag and an index just make sure an index of 0 means
    no need to free the bvec array. Also move the constants related
    to the bvec pools together and use a consistent naming scheme for
    them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • WRITE SAME is a data integrity operation and we can't simply ignore
    errors.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently blkdev_issue_zeroout cascades down from discards (if the driver
    guarantees that discards zero data), to WRITE SAME and then to a loop
    writing zeroes. Unfortunately we ignore run-time EOPNOTSUPP errors in the
    block layer blkdev_issue_discard helper to work around DM volumes that
    may have mixed discard support underneath.

    This patch intoroduces a new BLKDEV_DISCARD_ZERO flag to
    blkdev_issue_discard that indicates we are called for zeroing operation.
    This allows both to ignore the EOPNOTSUPP hack and actually consolidating
    the discard_zeroes_data check into the function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig