30 Sep, 2015

1 commit

  • commit 596f5aad2a704b72934e5abec1b1b4114c16f45b upstream.

    There may be lots of pending requests so that the buffer of PAGE_SIZE
    can't hold them at all.

    One typical example is scsi-mq, the queue depth(.can_queue) of
    scsi_host and blk-mq is quite big but scsi_device's queue_depth
    is a bit small(.cmd_per_lun), then it is quite easy to have lots
    of pending requests in hw queue.

    This patch fixes the following warning and the related memory
    destruction.

    [ 359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M
    [ 359.055595] irq event stamp: 15537^M
    [ 359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
    [ 359.055614] Dumping ftrace buffer:^M
    [ 359.055660] (ftrace buffer empty)^M
    [ 359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
    [ 359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M
    [ 359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
    [ 359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M
    [ 359.055693] RIP: 0010:[] [] __kmalloc+0xe8/0x152^M

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

14 Sep, 2015

1 commit

  • commit 4f258a46346c03fa0bbb6199ffaf4e1f9f599660 upstream.

    Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
    size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
    LIMITS VPD. This had the unfortunate effect of also limiting the maximum
    size of non-filesystem requests sent to the device through sg/bsg.

    Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
    limit directly.

    Also update the comment in blk_limits_max_hw_sectors() to clarify that
    max_hw_sectors defines the limit for the I/O controller only.

    Signed-off-by: Martin K. Petersen
    Reported-by: Brian King
    Tested-by: Brian King
    Signed-off-by: James Bottomley
    Signed-off-by: Greg Kroah-Hartman

    Martin K. Petersen
     

11 Aug, 2015

3 commits

  • commit e56f698bd0720e17f10f39e8b0b5b446ad0ab22c upstream.

    It is reasonable to set default timeout of request as 30 seconds instead of
    30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
    based systems may choose 100 HZ.

    Signed-off-by: Ming Lei
    Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 5f6c2d2b7dbb541c1e922538c49fa04c494ae3d7 upstream.

    When a blkcg configuration is targeted to a partition rather than a
    whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
    forgets to put the gendisk ref in that case. Fix it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit bb8bd38b9a1685334b73e8c62e128cbedb875867 upstream.

    bio_integrity_alloc() and bio_integrity_free() assume that if a bio was
    allocated from a bioset that that bioset also had its bio_integrity_pool
    allocated using bioset_integrity_create(). This is a very bad
    assumption given that bioset_create() and bioset_integrity_create() are
    completely disjoint. Not all callers of bioset_create() have been
    trained to also call bioset_integrity_create() -- and they may not care
    to be.

    Fix this by falling back to kmalloc'ing 'struct bio_integrity_payload'
    rather than force all bioset consumers to (wastefully) preallocate a
    bio_integrity_pool that they very likely won't actually need (given the
    niche nature of the current block integrity support).

    Otherwise, a NULL pointer "Kernel BUG" with a trace like the following
    will be observed (as seen on s390x using zfcp storage) because dm-io
    doesn't use bioset_integrity_create() when creating its bioset:

    [ 791.643338] Call Trace:
    [ 791.643339] ([] 0x3df98b848)
    [ 791.643341] [] bio_integrity_alloc+0x48/0xf8
    [ 791.643348] [] bio_integrity_prep+0xae/0x2f0
    [ 791.643349] [] blk_queue_bio+0x1c8/0x3d8
    [ 791.643355] [] generic_make_request+0xc0/0x100
    [ 791.643357] [] submit_bio+0xa2/0x198
    [ 791.643406] [] dispatch_io+0x15c/0x3b0 [dm_mod]
    [ 791.643419] [] dm_io+0x176/0x2f0 [dm_mod]
    [ 791.643423] [] do_reads+0x13a/0x1a8 [dm_mirror]
    [ 791.643425] [] do_mirror+0x142/0x298 [dm_mirror]
    [ 791.643428] [] process_one_work+0x18a/0x3f8
    [ 791.643432] [] worker_thread+0x132/0x3b0
    [ 791.643435] [] kthread+0xd2/0xd8
    [ 791.643438] [] kernel_thread_starter+0x6/0xc
    [ 791.643446] [] kernel_thread_starter+0x0/0xc

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

04 Aug, 2015

1 commit

  • commit f3f5da624e0a891c34d8cd513c57f1d9b0c7dadc upstream.

    This fixes a data corruption bug when using discard on top of MD linear,
    raid0 and raid10 personalities.

    Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
    the bio_vec between the two resulting bios. That is fine for read/write
    requests where the bio_vec is immutable. For discards, however, we need
    to be able to attach a payload and update the bio_vec so the page can
    get mapped to a scatterlist entry. Therefore the bio_vec can not be
    shared when splitting discards and we must do a full clone.

    Signed-off-by: Martin K. Petersen
    Reported-by: Seunguk Shin
    Tested-by: Seunguk Shin
    Cc: Seunguk Shin
    Cc: Jens Axboe
    Cc: Kent Overstreet
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Martin K. Petersen
     

11 Jun, 2015

1 commit

  • =================================
    [ INFO: inconsistent lock state ]
    4.1.0-rc7+ #217 Tainted: G O
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (ext_devt_lock){+.?...}, at: [] blk_free_devt+0x3c/0x70
    {SOFTIRQ-ON-W} state was registered at:
    [] __lock_acquire+0x461/0x1e70
    [] lock_acquire+0xb7/0x290
    [] _raw_spin_lock+0x38/0x50
    [] blk_alloc_devt+0x6d/0xd0 ] __lock_acquire+0x3fe/0x1e70
    [] ? __lock_acquire+0xe5d/0x1e70
    [] lock_acquire+0xb7/0x290
    [] ? blk_free_devt+0x3c/0x70
    [] _raw_spin_lock+0x38/0x50
    [] ? blk_free_devt+0x3c/0x70
    [] blk_free_devt+0x3c/0x70 ] part_release+0x1c/0x50
    [] device_release+0x36/0xb0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] put_device+0x17/0x20
    [] delete_partition_rcu_cb+0x16c/0x180
    [] ? read_dev_sector+0xa0/0xa0
    [] rcu_process_callbacks+0x2ff/0xa90
    [] ? rcu_process_callbacks+0x2bf/0xa90
    [] __do_softirq+0xde/0x600

    Neil sees this in his tests and it also triggers on pmem driver unbind
    for the libnvdimm tests. This fix is on top of an initial fix by Keith
    for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block:
    Fix dev_t minor allocation lifetime". Both this and 2da78092dda1 are
    candidates for -stable.

    Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
    Cc:
    Cc: Keith Busch
    Reported-by: NeilBrown
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

10 Jun, 2015

1 commit

  • Now blk_cleanup_queue() can be called before calling
    del_gendisk()[1], inside which hctx->ctxs is touched
    from blk_mq_unregister_hctx(), but the variable has
    been freed by blk_cleanup_queue() at that time.

    So this patch moves freeing of hctx->ctxs into queue's
    release handler for fixing the oops reported by Stefan.

    [1], 6cd18e711dd8075 (block: destroy bdi before blockdev is
    unregistered)

    Reported-by: Stefan Seyfried
    Cc: NeilBrown
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org (v4.0)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

13 May, 2015

1 commit

  • With commit ff36ab345 ("dm: remove request-based logic from
    make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
    so remove its export. Doing so required a forward declaration in
    blk-core.c.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

05 May, 2015

1 commit

  • Normally if driver is busy to dispatch a request the logic is like below:
    block layer: driver:
    __blk_mq_run_hw_queue
    a. blk_mq_stop_hw_queue
    b. rq add to ctx->dispatch

    later:
    1. blk_mq_start_hw_queue
    2. __blk_mq_run_hw_queue

    But it's possible step 1-2 runs between a and b. And since rq isn't in
    ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
    there are no subsequent requests kick in.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

28 Apr, 2015

1 commit

  • Because of the peculiar way that md devices are created (automatically
    when the device node is opened), a new device can be created and
    registered immediately after the
    blk_unregister_region(disk_devt(disk), disk->minors);
    call in del_gendisk().

    Therefore it is important that all visible artifacts of the previous
    device are removed before this call. In particular, the 'bdi'.

    Since:
    commit c4db59d31e39ea067c32163ac961e9c80198fd37
    Author: Christoph Hellwig
    fs: don't reassign dirty inodes to default_backing_dev_info

    moved the
    device_unregister(bdi->dev);
    call from bdi_unregister() to bdi_destroy() it has been quite easy to
    lose a race and have a new (e.g.) "md127" be created after the
    blk_unregister_region() call and before bdi_destroy() is ultimately
    called by the final 'put_disk', which must come after del_gendisk().

    The new device finds that the bdi name is already registered in sysfs
    and complains

    > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
    > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

    We can fix this by moving the bdi_destroy() call out of
    blk_release_queue() (which can happen very late when a refcount
    reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
    device driver calls it.

    Then it is only necessary for md to call blk_cleanup_queue() before
    del_gendisk(). As loop.c devices are also created on demand by
    opening the device node, we make the same change there.

    Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37
    Reported-by: Azat Khuzhin
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org (v4.0)
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

27 Apr, 2015

1 commit

  • Commit d2c5e30c9a1420902262aa923794d2ae4e0bc391
    ("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter")
    convert statistic of nr_bounce to per zone and one global value in vm_stat,
    but it call inc_|dec_zone_page_state on different pages, then different
    zones, and cause us to get unexpected value of NR_BOUNCE.

    Below is the result on my machine:
    Mar 2 09:26:08 udknight kernel: [144766.778265] Mem-Info:
    Mar 2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu:
    Mar 2 09:26:08 udknight kernel: [144766.778268] CPU 0: hi: 0, btch: 1 usd: 0
    Mar 2 09:26:08 udknight kernel: [144766.778269] CPU 1: hi: 0, btch: 1 usd: 0
    Mar 2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu:
    Mar 2 09:26:08 udknight kernel: [144766.778271] CPU 0: hi: 186, btch: 31 usd: 0
    Mar 2 09:26:08 udknight kernel: [144766.778273] CPU 1: hi: 186, btch: 31 usd: 0
    Mar 2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu:
    Mar 2 09:26:08 udknight kernel: [144766.778275] CPU 0: hi: 186, btch: 31 usd: 0
    Mar 2 09:26:08 udknight kernel: [144766.778276] CPU 1: hi: 186, btch: 31 usd: 0
    Mar 2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0
    Mar 2 09:26:08 udknight kernel: [144766.778279] active_file:105085 inactive_file:139432 isolated_file:0
    Mar 2 09:26:08 udknight kernel: [144766.778279] unevictable:653 dirty:0 writeback:0 unstable:0
    Mar 2 09:26:08 udknight kernel: [144766.778279] free:178957 slab_reclaimable:6419 slab_unreclaimable:9966
    Mar 2 09:26:08 udknight kernel: [144766.778279] mapped:4426 shmem:305277 pagetables:784 bounce:0
    Mar 2 09:26:08 udknight kernel: [144766.778279] free_cma:0
    Mar 2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    Mar 2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754
    Mar 2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes
    Mar 2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451
    Mar 2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    Mar 2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0

    You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global.

    This patch fix it.

    Signed-off-by: Wang YanQing
    Signed-off-by: Jens Axboe

    Wang YanQing
     

24 Apr, 2015

3 commits

  • Our issue is descripted in below call path:
    ->elevator_init
    ->elevator_init_fn
    ->{cfq,deadline,noop}_init_queue
    ->elevator_alloc
    ->kzalloc_node
    fail to call kzalloc_node and then put module in elevator_alloc;
    fail to call elevator_init_fn and then put module again in elevator_init.

    Remove elevator_put invoking in error path of elevator_alloc to avoid
    double release issue.

    Signed-off-by: Chao Yu
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Chao Yu
     
  • hctx->tags has to be set as NULL in case that it is to be unmapped
    no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
    because shared tags can be freed already from another request queue.

    The same situation has to be considered during handling CPU online too.
    Unmapped hw queue can be remapped after CPU topo is changed, so we need
    to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
    allocation for hw queue can be removed in hctx cpu online notifier, and it
    is reasonable to do that after mapping is updated.

    Cc:
    Reported-by: Dongsu Park
    Tested-by: Dongsu Park
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Firstly during CPU hotplug, even queue is freezed, timeout
    handler still may come and access hctx->tags, which may cause
    use after free, so this patch deactivates timeout handler
    inside CPU hotplug notifier.

    Secondly, tags can be shared by more than one queues, so we
    have to check if the hctx has been unmapped, otherwise
    still use-after-free on tags can be triggered.

    Cc:
    Reported-by: Dongsu Park
    Tested-by: Dongsu Park
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Apr, 2015

2 commits

  • Commit 889fa31f00b2 was a bit too eager in reducing the loop count,
    so we ended up missing queues in some configurations. Ensure that
    our division rounds up, so that's not the case.

    Reported-by: Guenter Roeck
    Fixes: 889fa31f00b2 ("blk-mq: reduce unnecessary software queue looping")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull block layer core bits from Jens Axboe:
    "This is the core pull request for 4.1. Not a lot of stuff in here for
    this round, mostly little fixes or optimizations. This pull request
    contains:

    - An optimization that speeds up queue runs on blk-mq, especially for
    the case where there's a large difference between nr_cpu_ids and
    the actual mapped software queues on a hardware queue. From Chong
    Yuan.

    - Honor node local allocations for requests on legacy devices. From
    David Rientjes.

    - Cleanup of blk_mq_rq_to_pdu() from me.

    - exit_aio() fixup from me, greatly speeding up exiting multiple IO
    contexts off exit_group(). For my particular test case, fio exit
    took ~6 seconds. A typical case of both exposing RCU grace periods
    to user space, and serializing exit of them.

    - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
    wait if __GFP_WAIT is set. From Keith Busch.

    - blk-mq exports and two added helpers from Mike Snitzer, which will
    be used by the dm-mq code.

    - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"

    * 'for-4.1/core' of git://git.kernel.dk/linux-block:
    blk-mq: reduce unnecessary software queue looping
    aio: fix serial draining in exit_aio()
    blk-mq: cleanup blk_mq_rq_to_pdu()
    blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
    block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
    block: allocate request memory local to request queue
    blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
    blk-mq: export blk_mq_run_hw_queues
    blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk

    Linus Torvalds
     

16 Apr, 2015

1 commit

  • In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
    ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
    times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
    Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
    Change ->map_size to contain the actually mapped software queues, so we
    only loop for as many iterations as we have to.

    And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
    since they are all re-done in blk_mq_map_swqueue().
    blk_mq_map_swqueue().

    Signed-off-by: Chong Yuan
    Reviewed-by: Wenbo Wang

    Updated by me for formatting and commenting.

    Signed-off-by: Jens Axboe

    Chong Yuan
     

15 Apr, 2015

1 commit

  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds
     

12 Apr, 2015

3 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and don't skip access_ok() validation.

    Signed-off-by: Al Viro

    Al Viro
     
  • Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
    pointer in scsi_init_cmd_errh() with the blk-mq code.

    The sense_buffer pointer should have been initialized by the call to
    scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
    some non-repeatable memory corruptor.

    This patch makes sure we initialize the whole struct request allocation
    (and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
    using __GFP_ZERO in the allocation. The old code initialized a couple
    of individual fields, leaving the rest undefined (although many of them
    are then initialized in later phases, like blk_mq_rq_ctx_init() etc.

    It's not entirely clear why this matters, but it's the rigth thing to do
    regardless, and with 4.0 imminent this is the defensive "let's just make
    sure everything is initialized properly" patch.

    Tested-by: Jan Engelhardt
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2015

1 commit

  • Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
    caused blk_stack_limits() to not properly stack queue_limits for stacked
    devices (e.g. DM).

    Fix this regression by establishing lcm_not_zero() and switching
    blk_stack_limits() over to using it.

    DM uses blk_set_stacking_limits() to establish the initial top-level
    queue_limits that are then built up based on underlying devices' limits
    using blk_stack_limits(). In the case of optimal_io_size (io_opt)
    blk_set_stacking_limits() establishes a default value of 0. With commit
    69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
    the underlying devices' io_opt.

    Test:
    $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
    $ cat /sys/block/sde/queue/optimal_io_size
    786432
    $ dmsetup create node --table "0 100 linear /dev/sde 0"

    Before this fix:
    $ cat /sys/block/dm-5/queue/optimal_io_size
    0

    After this fix:
    $ cat /sys/block/dm-5/queue/optimal_io_size
    786432

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # 3.19+
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

30 Mar, 2015

2 commits


25 Mar, 2015

1 commit

  • blk_init_rl() allocates a mempool using mempool_create_node() with node
    local memory. This only allocates the mempool and element list locally
    to the requeue queue node.

    What we really want to do is allocate the request itself local to the
    queue. To do this, we need our own alloc and free functions that will
    allocate from request_cachep and pass the request queue node in to prefer
    node local memory.

    Acked-by: Tejun Heo
    Signed-off-by: David Rientjes
    Signed-off-by: Jens Axboe

    David Rientjes
     

20 Mar, 2015

1 commit

  • Use the right array index to reference the last
    element of rq->biotail->bi_io_vec[]

    Signed-off-by: Wenbo Wang
    Reviewed-by: Chong Yuan
    Fixes: 66cb45aa41315 ("block: add support for limiting gaps in SG lists")
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Wenbo Wang
     

19 Mar, 2015

1 commit

  • When allocating from the reserved tags pool, bt_get() is called with
    a NULL hctx. If all tags are in use, the hw queue is kicked to push
    out any pending IO, potentially freeing tags, and tag allocation is
    retried. The problem is that blk_mq_run_hw_queue() doesn't check for
    a NULL hctx. So we avoid it with a simple NULL hctx test.

    Tested by hammering mtip32xx with concurrent smartctl/hdparm.

    Signed-off-by: Sam Bradshaw
    Signed-off-by: Selvan Mani
    Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()")
    Cc: stable@kernel.org

    Added appropriate comment.

    Signed-off-by: Jens Axboe

    Sam Bradshaw
     

13 Mar, 2015

4 commits

  • Return -EBUSY if we're unable to enter a queue immediately when
    allocating a blk-mq request without __GFP_WAIT.

    Signed-off-by: Keith Busch
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
    and export it.

    DM's suspend support must be able to run the queue without starting
    stopped hw queues.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Add a variant of blk_mq_init_queue that allows a previously allocated
    queue to be initialized. blk_mq_init_allocated_queue models
    blk_init_allocated_queue -- which was also created for DM's use.

    DM's approach to device creation requires a placeholder request_queue be
    allocated for use with alloc_dev() but the decision about what type of
    request_queue will be ultimately created is deferred until all component
    devices referenced in the DM table are processed to determine the table
    type (request-based, blk-mq request-based, or bio-based).

    Also, because of DM's late finalization of the request_queue type
    the call to blk_mq_register_disk() doesn't happen during alloc_dev().
    Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
    once the blk-mq queue is fully allocated.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • If percpu_ref_init() fails the allocated q and hctxs must get cleaned
    up; using 'err_map' doesn't allow that to happen.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

21 Feb, 2015

1 commit

  • When reading blkio.throttle.io_serviced in a recently created blkio
    cgroup, it's possible to race against the creation of a throttle policy,
    which delays the allocation of stats_cpu.

    Like other functions in the throttle code, just checking for a NULL
    stats_cpu prevents the following oops caused by that race.

    [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
    [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
    [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
    [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
    [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
    [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
    [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
    [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0)
    [ 1137.734230] MSR: 9000000000009032 CR: 42008884 XER: 20000000
    [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
    GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
    GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
    GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
    GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
    GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
    [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
    [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
    [ 1137.734943] Call Trace:
    [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
    [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
    [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
    [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
    [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
    [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
    [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
    [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
    [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
    [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
    [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
    [ 1137.735383] Instruction dump:
    [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
    [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 e9090008 e9490010 e9290018

    And here is one code that allows to easily reproduce this, although this
    has first been found by running docker.

    void run(pid_t pid)
    {
    int n;
    int status;
    int fd;
    char *buffer;
    buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
    n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
    fd = open(CGPATH "/test/tasks", O_WRONLY);
    write(fd, buffer, n);
    close(fd);
    if (fork() > 0) {
    fd = open("/dev/sda", O_RDONLY | O_DIRECT);
    read(fd, buffer, 512);
    close(fd);
    wait(&status);
    } else {
    fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
    n = read(fd, buffer, BUFFER_SIZE);
    close(fd);
    }
    free(buffer);
    exit(0);
    }

    void test(void)
    {
    int status;
    mkdir(CGPATH "/test", 0666);
    if (fork() > 0)
    wait(&status);
    else
    run(getpid());
    rmdir(CGPATH "/test");
    }

    int main(int argc, char **argv)
    {
    int i;
    for (i = 0; i < NR_TESTS; i++)
    test();
    return 0;
    }

    Reported-by: Ricardo Marin Matinata
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Thadeu Lima de Souza Cascardo
     

13 Feb, 2015

3 commits

  • Pull block driver changes from Jens Axboe:
    "This contains:

    - The 4k/partition fixes for brd from Boaz/Matthew.

    - A few xen front/back block fixes from David Vrabel and Roger Pau
    Monne.

    - Floppy changes from Takashi, cleaning the device file creation.

    - Switching libata to use the new blk-mq tagging policy, removing
    code (and a suboptimal implementation) from libata. This will
    throw you a merge conflict, since a bug in the original libata
    tagging code was fixed since this code was branched. Trivial.
    From Shaohua.

    - Conversion of loop to blk-mq, from Ming Lei.

    - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
    He claims it improves on unreadable code, which will cost him a
    beer.

    - Maintainer update or NDB, now handled by Markus Pargmann.

    - NVMe:
    - Optimization from me that avoids a kmalloc/kfree per IO for
    smaller (t handle REQ_FUA explicitly
    block: loop: introduce lo_discard() and lo_req_flush()
    block: loop: say goodby to bio
    block: loop: improve performance via blk-mq

    Linus Torvalds
     
  • Pull core block IO changes from Jens Axboe:
    "This contains:

    - A series from Christoph that cleans up and refactors various parts
    of the REQ_BLOCK_PC handling. Contributions in that series from
    Dongsu Park and Kent Overstreet as well.

    - CFQ:
    - A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
    - A stable patch fixing a potential crash in CFQ in OOM
    situations. From Konstantin Khlebnikov.

    - blk-mq:
    - Add support for tag allocation policies, from Shaohua. This is
    a prep patch enabling libata (and other SCSI parts) to use the
    blk-mq tagging, instead of rolling their own.
    - Various little tweaks from Keith and Mike, in preparation for
    DM blk-mq support.
    - Minor little fixes or tweaks from me.
    - A double free error fix from Tony Battersby.

    - The partition 4k issue fixes from Matthew and Boaz.

    - Add support for zero+unprovision for blkdev_issue_zeroout() from
    Martin"

    * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
    block: remove unused function blk_bio_map_sg
    block: handle the null_mapped flag correctly in blk_rq_map_user_iov
    blk-mq: fix double-free in error path
    block: prevent request-to-request merging with gaps if not allowed
    blk-mq: make blk_mq_run_queues() static
    dm: fix multipath regression due to initializing wrong request
    cfq-iosched: handle failure of cfq group allocation
    block: Quiesce zeroout wrapper
    block: rewrite and split __bio_copy_iov()
    block: merge __bio_map_user_iov into bio_map_user_iov
    block: merge __bio_map_kern into bio_map_kern
    block: pass iov_iter to the BLOCK_PC mapping functions
    block: add a helper to free bio bounce buffer pages
    block: use blk_rq_map_user_iov to implement blk_rq_map_user
    block: simplify bio_map_kern
    block: mark blk-mq devices as stackable
    block: keep established cmd_flags when cloning into a blk-mq request
    block: add blk-mq support to blk_insert_cloned_request()
    block: require blk_rq_prep_clone() be given an initialized clone request
    blk-mq: add tag allocation policy
    ...

    Linus Torvalds
     
  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

12 Feb, 2015

3 commits