20 Jun, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "A smaller collection of fixes for the block core that would be nice to
    have in -rc2. This pull request contains:

    - Fixes for races in the wait/wakeup logic used in blk-mq from
    Alexander. No issues have been observed, but it is definitely a
    bit flakey currently. Alternatively, we may drop the cyclic
    wakeups going forward, but that needs more testing.

    - Some cleanups from Christoph.

    - Fix for an oops in null_blk if queue_mode=1 and softirq completions
    are used. From me.

    - A fix for a regression caused by the chunk size setting. It
    inadvertently used max_hw_sectors instead of max_sectors, which is
    incorrect, and causes hangs on btrfs multi-disk setups (where hw
    sectors apparently isn't set). From me.

    - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a
    recent addition as well, but it actually breaks blk-mq which relies
    on strict scheduling. If the workqueue power_efficient mode is
    turned on, this breaks blk-mq. From Matias.

    - null_blk module parameter description fix from Mike"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: bitmap tag: fix races in bt_get() function
    blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
    blk-mq: bitmap tag: fix races on shared ::wake_index fields
    block: blk_max_size_offset() should check ->max_sectors
    null_blk: fix softirq completions for queue_mode == 1
    blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
    blk-mq: properly drain stopped queues
    block: remove WQ_POWER_EFFICIENT from kblockd
    null_blk: fix name and description of 'queue_mode' module parameter
    block: remove elv_abort_queue and blk_abort_flushes

    Linus Torvalds
     

16 Jun, 2014

1 commit

  • Pull NVMe update from Matthew Wilcox:
    "Mostly bugfixes again for the NVMe driver. I'd like to call out the
    exported tracepoint in the block layer; I believe Keith has cleared
    this with Jens.

    We've had a few reports from people who're really pounding on NVMe
    devices at scale, hence the timeout changes (and new module
    parameters), hotplug cpu deadlock, tracepoints, and minor performance
    tweaks"

    [ Jens hadn't seen that tracepoint thing, but is ok with it - it will
    end up going away when mq conversion happens ]

    * git://git.infradead.org/users/willy/linux-nvme: (22 commits)
    NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
    NVMe: Use Log Page constants in SCSI emulation
    NVMe: Define Log Page constants
    NVMe: Fix hot cpu notification dead lock
    NVMe: Rename io_timeout to nvme_io_timeout
    NVMe: Use last bytes of f/w rev SCSI Inquiry
    NVMe: Adhere to request queue block accounting enable/disable
    NVMe: Fix nvme get/put queue semantics
    NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
    NVMe: Make admin timeout a module parameter
    NVMe: Make iod bio timeout a parameter
    NVMe: Prevent possible NULL pointer dereference
    NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
    NVMe: Update data structures for NVMe 1.2
    NVMe: Enable BUILD_BUG_ON checks
    NVMe: Update namespace and controller identify structures to the 1.1a spec
    NVMe: Flush with data support
    NVMe: Configure support for block flush
    NVMe: Add tracepoints
    NVMe: Protect against badly formatted CQEs
    ...

    Linus Torvalds
     

12 Jun, 2014

1 commit

  • blk-mq issues async requests through kblockd. To issue a work request on
    a specific CPU, kblockd_schedule_delayed_work_on is used. However, the
    specific CPU choice may not be honored, if the power_efficient option
    for workqueues is set. blk-mq requires that we have strict per-cpu
    scheduling, so it wont work properly if kblockd is marked
    POWER_EFFICIENT and power_efficient is set.

    Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior.
    This essentially reverts part of commit 695588f9454b, which added
    the WQ_POWER_EFFICIENT marker to kblockd.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

06 Jun, 2014

1 commit

  • With the optimizations around not clearing the full request at alloc
    time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
    up to the user allocating the request.

    Add a blk_rq_set_block_pc() that sets the command type to
    REQ_TYPE_BLOCK_PC, and properly initializes the members associated
    with this type of request. Update callers to use this function instead
    of manipulating rq->cmd_type directly.

    Includes fixes from Christoph Hellwig for my half-assed
    attempt.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 May, 2014

1 commit


28 May, 2014

1 commit


27 May, 2014

1 commit


21 May, 2014

2 commits

  • In blk_mq_make_request(), do the blk_queue_nomerges() check
    outside the call to blk_attempt_plug_merge() to eliminate
    function call overhead when nomerges=2 (disabled)

    Signed-off-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Robert Elliott
     
  • For request_fn based devices, the block layer exports a 'nr_requests'
    file through sysfs to allow adjusting of queue depth on the fly.
    Currently this returns -EINVAL for blk-mq, since it's not wired up.
    Wire this up for blk-mq, so that it now also always dynamic
    adjustments of the allowed queue depth for any given block device
    managed by blk-mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 May, 2014

1 commit

  • We first check if we have inflight IO, then retrieve that
    same number again. Usually this isn't that costly since the
    chance of having the data dirtied in between is small, but
    there's no reason for calling part_in_flight() twice.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 May, 2014

1 commit

  • Adding tracepoints for bio_complete and block_split into nvme to help
    with gathering IO info using blktrace and blkparse.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     

17 Apr, 2014

2 commits


16 Apr, 2014

2 commits

  • This was used in the olden days, back when onions were proper
    yellow. Basically it mapped to the current buffer to be
    transferred. With highmem being added more than a decade ago,
    most drivers map pages out of a bio, and rq->buffer isn't
    pointing at anything valid.

    Convert old style drivers to just use bio_data().

    For the discard payload use case, just reference the page
    in the bio.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't like this, but things have diverged with the blk-mq fixes
    in 3.15-rc1. So merge it in.

    Jens Axboe
     

11 Apr, 2014

1 commit


10 Apr, 2014

3 commits

  • Martin reported that his test system would not boot with
    current git, it oopsed with this:

    BUG: unable to handle kernel paging request at ffff88046c6c9e80
    IP: [] blk_queue_start_tag+0x90/0x150
    PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
    Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
    rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
    CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
    Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
    Workqueue: events_unbound async_run_entry_fn
    task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
    RIP: 0010:[] []
    blk_queue_start_tag+0x90/0x150
    RSP: 0018:ffff880273d03a58 EFLAGS: 00010092
    RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
    RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
    RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
    R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
    R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
    FS: 0000000000000000(0000) GS:ffff880277b00000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
    Stack:
    ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
    ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
    ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
    Call Trace:
    [] scsi_request_fn+0xf2/0x510
    [] __blk_run_queue+0x37/0x50
    [] blk_execute_rq_nowait+0xb3/0x130
    [] blk_execute_rq+0x64/0xf0
    [] ? bit_waitqueue+0xd0/0xd0
    [] scsi_execute+0xe5/0x180
    [] scsi_execute_req_flags+0x9a/0x110
    [] sd_spinup_disk+0x94/0x460 [sd_mod]
    [] ? __unmap_hugepage_range+0x200/0x2f0
    [] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
    [] sd_probe_async+0xd8/0x200 [sd_mod]
    [] async_run_entry_fn+0x3f/0x140
    [] process_one_work+0x175/0x410
    [] worker_thread+0x123/0x400
    [] ? manage_workers+0x160/0x160
    [] kthread+0xce/0xf0
    [] ? kthread_freezable_should_stop+0x70/0x70
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_freezable_should_stop+0x70/0x70
    Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
    df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 89
    58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
    RIP [] blk_queue_start_tag+0x90/0x150
    RSP
    CR2: ffff88046c6c9e80

    Martin bisected and found this to be the problem patch;

    commit 6d113398dcf4dfcd9787a4ead738b186f7b7ff0f
    Author: Jan Kara
    Date: Mon Feb 24 16:39:54 2014 +0100

    block: Stop abusing rq->csd.list in blk-softirq

    and the problem was immediately apparent. The patch states that
    it is safe to reuse queuelist at completion time, since it is
    no longer used. However, that is not true if a device is using
    block enabled tagging. If that is the case, then the queuelist
    is reused to keep track of busy tags. If a device also ended
    up using softirq completions, we'd reuse ->queuelist for the
    IPI handling while block tagging was still using it. Boom.

    Fix this by adding a new ipi_list list head, and share the
    memory used with the request hash table. The hash table is
    never used after the request is moved to the dispatch list,
    which happens long before any potential completion of the
    request. Add a new request bit for this, so we don't have
    cases that check rq->hash while it could potentially have
    been reused for the IPI completion.

    Reported-by: Martin K. Petersen
    Tested-by: Benjamin Herrenschmidt
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Same function as kblockd_schedule_delayed_work(), but allow the
    caller to pass in a CPU that the work should be executed on. This
    just directly extends and maps into the workqueue API, and will
    be used to make the blk-mq mappings more strict.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The queue parameter is never used, just get rid of it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Apr, 2014

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual rocket science -- mostly documentation and comment updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    sparse: fix comment
    doc: fix double words
    isdn: capi: fix "CAPI_VERSION" comment
    doc: DocBook: Fix typos in xml and template file
    Bluetooth: add module name for btwilink
    driver core: unexport static function create_syslog_header
    mmc: core: typo fix in printk specifier
    ARM: spear: clean up editing mistake
    net-sysfs: fix comment typo 'CONFIG_SYFS'
    doc: Insert MODULE_ in module-signing macros
    Documentation: update URL to hfsplus Technote 1150
    gpio: update path to documentation
    ixgbe: Fix format string in ixgbe_fcoe.
    Kconfig: Remove useless "default N" lines
    user_namespace.c: Remove duplicated word in comment
    CREDITS: fix formatting
    treewide: Fix typo in Documentation/DocBook
    mm: Fix warning on make htmldocs caused by slab.c
    ata: ata-samsung_cf: cleanup in header file
    idr: remove unused prototype of idr_free()

    Linus Torvalds
     

02 Apr, 2014

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the pull request for the core block IO bits for the 3.15
    kernel. It's a smaller round this time, it contains:

    - Various little blk-mq fixes and additions from Christoph and
    myself.

    - Cleanup of the IPI usage from the block layer, and associated
    helper code. From Frederic Weisbecker and Jan Kara.

    - Duplicate code cleanup in bio-integrity from Gu Zheng. This will
    give you a merge conflict, but that should be easy to resolve.

    - blk-mq notify spinlock fix for RT from Mike Galbraith.

    - A blktrace partial accounting bug fix from Roman Pen.

    - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

    * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    blk-mq: don't dump CPU -> hw queue map on driver load
    blk-mq: fix wrong usage of hctx->state vs hctx->flags
    blk-mq: allow blk_mq_init_commands() to return failure
    block: remove old blk_iopoll_enabled variable
    blktrace: fix accounting of partially completed requests
    smp: Rename __smp_call_function_single() to smp_call_function_single_async()
    smp: Remove wait argument from __smp_call_function_single()
    watchdog: Simplify a little the IPI call
    smp: Move __smp_call_function_single() below its safe version
    smp: Consolidate the various smp_call_function_single() declensions
    smp: Teach __smp_call_function_single() to check for offline cpus
    smp: Remove unused list_head from csd
    smp: Iterate functions through llist_for_each_entry_safe()
    block: Stop abusing rq->csd.list in blk-softirq
    block: Remove useless IPI struct initialization
    ...

    Linus Torvalds
     

21 Mar, 2014

1 commit


09 Mar, 2014

1 commit

  • Commit 1874198 ("blk-mq: rework flush sequencing logic") switched
    ->flush_rq from being an embedded member of the request_queue structure
    to being dynamically allocated in blk_init_queue_node().

    Request-based DM multipath doesn't use blk_init_queue_node(), instead it
    uses blk_alloc_queue_node() + blk_init_allocated_queue(). Because
    commit 1874198 placed the dynamic allocation of ->flush_rq in
    blk_init_queue_node() any flush issued to a dm-mpath device would crash
    with a NULL pointer, e.g.:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] blk_rq_init+0x1e/0xb0
    PGD bb3c7067 PUD bb01d067 PMD 0
    Oops: 0002 [#1] SMP
    ...
    CPU: 5 PID: 5028 Comm: dt Tainted: G W O 3.14.0-rc3.snitm+ #10
    ...
    task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000
    RIP: 0010:[] [] blk_rq_init+0x1e/0xb0
    RSP: 0018:ffff880079565c98 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
    RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001
    R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000
    R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000
    FS: 00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0
    Stack:
    0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47
    ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000
    ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8
    Call Trace:
    [] blk_flush_complete_seq+0x2d7/0x2e0
    [] blk_insert_flush+0x1dd/0x210
    [] __elv_add_request+0x1f9/0x320
    [] ? blk_account_io_start+0x111/0x190
    [] blk_queue_bio+0x25b/0x330
    [] dm_request+0x35/0x40 [dm_mod]
    [] generic_make_request+0xc0/0x100
    [] submit_bio+0x73/0x140
    [] submit_bio_wait+0x5d/0x80
    [] blkdev_issue_flush+0x78/0xa0
    [] blkdev_fsync+0x3f/0x60
    [] vfs_fsync_range+0x1e/0x20
    [] vfs_fsync+0x1c/0x20
    [] do_fsync+0x41/0x80
    [] ? SyS_lseek+0x7e/0x80
    [] SyS_fsync+0x10/0x20
    [] system_call_fastpath+0x16/0x1b

    Fix this by moving the ->flush_rq allocation from blk_init_queue_node()
    to blk_init_allocated_queue(). blk_init_queue_node() also calls
    blk_init_allocated_queue() so this change is functionality equivalent
    for all blk_init_queue_node() callers.

    Reported-by: Hannes Reinecke
    Reported-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

06 Mar, 2014

1 commit

  • trace_block_rq_complete does not take into account that request can
    be partially completed, so we can get the following incorrect output
    of blkparser:

    C R 232 + 240 [0]
    C R 240 + 232 [0]
    C R 248 + 224 [0]
    C R 256 + 216 [0]

    but should be:

    C R 232 + 8 [0]
    C R 240 + 8 [0]
    C R 248 + 8 [0]
    C R 256 + 8 [0]

    Also, the whole output summary statistics of completed requests and
    final throughput will be incorrect.

    This patch takes into account real completion size of the request and
    fixes wrong completion accounting.

    Signed-off-by: Roman Pen
    CC: Steven Rostedt
    CC: Frederic Weisbecker
    CC: Ingo Molnar
    CC: linux-kernel@vger.kernel.org
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Roman Pen
     

20 Feb, 2014

1 commit


19 Feb, 2014

1 commit


11 Feb, 2014

1 commit

  • Witch to using a preallocated flush_rq for blk-mq similar to what's done
    with the old request path. This allows us to set up the request properly
    with a tag from the actually allowed range and ->rq_disk as needed by
    some drivers. To make life easier we also switch to dynamic allocation
    of ->flush_rq for the old path.

    This effectively reverts most of

    "blk-mq: fix for flush deadlock"

    and

    "blk-mq: Don't reserve a tag for flush request"

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Feb, 2014

1 commit


01 Jan, 2014

2 commits


24 Nov, 2013

2 commits

  • More prep work for immutable biovecs - with immutable bvecs drivers
    won't be able to use the biovec directly, they'll need to use helpers
    that take into account bio->bi_iter.bi_bvec_done.

    This updates callers for the new usage without changing the
    implementation yet.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Paul Clements
    Cc: Jim Paris
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Nagalakshmi Nandigama
    Cc: Sreekanth Reddy
    Cc: support@lsi.com
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: Alexander Viro
    Cc: Steven Whitehouse
    Cc: Herton Ronaldo Krzesinski
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Matthew Wilcox
    Cc: Keith Busch
    Cc: Stephen Hemminger
    Cc: Quoc-Son Anh
    Cc: Sebastian Ott
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Seth Jennings
    Cc: "Martin K. Petersen"
    Cc: Mike Snitzer
    Cc: Vivek Goyal
    Cc: "Darrick J. Wong"
    Cc: Chris Metcalf
    Cc: Jan Kara
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: drbd-user@lists.linbit.com
    Cc: nbd-general@lists.sourceforge.net
    Cc: cbe-oss-dev@lists.ozlabs.org
    Cc: xen-devel@lists.xensource.com
    Cc: virtualization@lists.linux-foundation.org
    Cc: linux-raid@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: DL-MPTFusionLinux@lsi.com
    Cc: linux-scsi@vger.kernel.org
    Cc: devel@driverdev.osuosl.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: cluster-devel@redhat.com
    Cc: linux-mm@kvack.org
    Acked-by: Geoff Levand

    Kent Overstreet
     
  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

09 Nov, 2013

3 commits

  • Signed-off-by: Jens Axboe

    Conflicts:
    block/blk-timeout.c

    Jens Axboe
     
  • This patch enables the sysfs to control I/O request merge
    functionality in the plug list. While this control has been
    implemented for the request queue, it was dismissed in the plug list.
    Therefore, block layer merges requests together (or attempt to merge)
    even if the merge capability was disable using sysfs nomerge parameter
    value 2.

    This limitation is directly affects functionality of io_submit()
    system call. The system call enables user to submit a bunch of IO
    requests from user space using struct iocb **ios input argument.
    However, the unconditioned merging functionality in the plug list
    potentially merges these requests together down the road. Therefore,
    there is no way to distinguish between an application sending bunch of
    sequential IOs and an application sending one big IO. Ultimately, all
    requests generated by the former app merge within the plug list
    together and looks similar to the second app.

    While the merging functionality is a desirable feature to improve the
    performance of IO subsystem for some applications, it is not useful
    for other application like ours at all.

    Signed-off-by: Alireza Haghdoost
    Reviewed-by: Jeff Moyer

    Coding style modified.

    Signed-off-by: Jens Axboe

    Alireza Haghdoost
     
  • The soft lockup below happens at the boot time of the system using dm
    multipath and the udev rules to switch scheduler.

    [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
    [ 356.127001] RIP: 0010:[] [] lock_timer_base.isra.35+0x1d/0x50
    ...
    [ 356.127001] Call Trace:
    [ 356.127001] [] try_to_del_timer_sync+0x20/0x70
    [ 356.127001] [] ? kmem_cache_alloc_node_trace+0x20a/0x230
    [ 356.127001] [] del_timer_sync+0x52/0x60
    [ 356.127001] [] cfq_exit_queue+0x32/0xf0
    [ 356.127001] [] elevator_exit+0x2f/0x50
    [ 356.127001] [] elevator_change+0xf1/0x1c0
    [ 356.127001] [] elv_iosched_store+0x20/0x50
    [ 356.127001] [] queue_attr_store+0x59/0xb0
    [ 356.127001] [] sysfs_write_file+0xc6/0x140
    [ 356.127001] [] vfs_write+0xbd/0x1e0
    [ 356.127001] [] SyS_write+0x49/0xa0
    [ 356.127001] [] system_call_fastpath+0x16/0x1b

    This is caused by a race between md device initialization by multipathd and
    shell script to switch the scheduler using sysfs.

    - multipathd:
    SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
    -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
    q->elevator = elevator_alloc(q, e); // not yet initialized

    - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
    elevator_switch (in the call trace above)
    struct elevator_queue *old = q->elevator;
    q->elevator = elevator_alloc(q, new_e);
    elevator_exit(old); // lockup! (*)

    - multipathd: (cont.)
    err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified

    (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
    while timer->base == NULL. In this case, as timer will never initialized,
    it results in lockup.

    This patch introduces acquisition of q->sysfs_lock around elevator_init()
    into blk_init_allocated_queue(), to provide mutual exclusion between
    initialization of the q->scheduler and switching of the scheduler.

    This should fix this bugzilla:
    https://bugzilla.redhat.com/show_bug.cgi?id=902012

    Signed-off-by: Tomoki Sekiyama
    Signed-off-by: Jens Axboe

    Tomoki Sekiyama
     

08 Nov, 2013

2 commits

  • If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
    to clean up structures allocated by the backing dev.

    ------------[ cut here ]------------
    WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
    ODEBUG: free active (active state 0) object type: percpu_counter hint: (null)
    Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
    CPU: 0 PID: 2739 Comm: lvchange Tainted: G W
    3.10.15-devel #14
    Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
    0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
    ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
    ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x6b/0xa0
    [] warn_slowpath_fmt+0x47/0x50
    [] ? debug_check_no_obj_freed+0xcf/0x250
    [] debug_print_object+0x85/0xa0
    [] debug_check_no_obj_freed+0x203/0x250
    [] kmem_cache_free+0x20c/0x3a0
    [] blk_alloc_queue_node+0x2a9/0x2c0
    [] blk_alloc_queue+0xe/0x10
    [] dm_create+0x1a3/0x530 [dm_mod]
    [] ? list_version_get_info+0xe0/0xe0 [dm_mod]
    [] dev_create+0x57/0x2b0 [dm_mod]
    [] ? list_version_get_info+0xe0/0xe0 [dm_mod]
    [] ? list_version_get_info+0xe0/0xe0 [dm_mod]
    [] ctl_ioctl+0x268/0x500 [dm_mod]
    [] ? get_lock_stats+0x22/0x70
    [] dm_ctl_ioctl+0xe/0x20 [dm_mod]
    [] do_vfs_ioctl+0x2ed/0x520
    [] ? fget_light+0x377/0x4e0
    [] SyS_ioctl+0x4b/0x90
    [] system_call_fastpath+0x1a/0x1f
    ---[ end trace 4b5ff0d55673d986 ]---
    ------------[ cut here ]------------

    This fix should be backported to stable kernels starting with 2.6.37. Note
    that in the kernels prior to 3.5 the affected code is different, but the
    bug is still there - bdi_init is called and bdi_destroy isn't.

    Signed-off-by: Mikulas Patocka
    Acked-by: Tejun Heo
    Cc: stable@kernel.org # 2.6.37+
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

    Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
    RIP: 0010:[] [] blk_requeue_request+0x94/0xa0
    RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
    RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
    RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
    RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
    R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
    FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
    Stack:
    0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
    ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
    ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
    Call Trace:
    [] __scsi_queue_insert+0xa3/0x150
    [] ? scsi_eh_ready_devs+0x5e3/0x850
    [] scsi_queue_insert+0x13/0x20
    [] scsi_eh_flush_done_q+0x104/0x160
    [] scsi_error_handler+0x35b/0x660
    [] ? scsi_error_handler+0x0/0x660
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20
    Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
    RIP [] blk_requeue_request+0x94/0xa0
    RSP

    The RIP is this line:
    BUG_ON(blk_queued_rq(rq));

    After digging through the code, I think there may be a race between the
    request completion and the timer handler running.

    A timer is started for each request put on the device's queue (see
    blk_start_request->blk_add_timer). If the request does not complete
    before the timer expires, the timer handler (blk_rq_timed_out_timer)
    will mark the request complete atomically:

    static inline int blk_mark_rq_complete(struct request *rq)
    {
    return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
    }

    and then call blk_rq_timed_out. The latter function will call
    scsi_times_out, which will return one of BLK_EH_HANDLED,
    BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
    returned, blk_clear_rq_complete is called, and blk_add_timer is again
    called to simply wait longer for the request to complete.

    Now, if the request happens to complete while this is going on, what
    happens? Given that we know the completion handler will bail if it
    finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
    handler running after that bit is cleared. So, from the above
    paragraph, after the call to blk_clear_rq_complete. If the completion
    sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
    there (I haven't seen this in the cores). Next, if we get the
    completion before the call to list_add_tail, then the timer will
    eventually fire for an old req, which may either be freed or reallocated
    (there is evidence that this might be the case). Finally, if the
    completion comes in *after* the addition to the timeout list, I think
    it's harmless. The request will be removed from the timeout list,
    req_atom_complete will be set, and all will be well.

    This will only actually explain the coredumps *IF* the request
    structure was freed, reallocated *and* queued before the error handler
    thread had a chance to process it. That is possible, but it may make
    sense to keep digging for another race. I think that if this is what
    was happening, we would see other instances of this problem showing up
    as null pointer or garbage pointer dereferences, for example when the
    request structure was not re-used. It looks like we actually do run
    into that situation in other reports.

    This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
    &req->atomic_flags)); from blk_add_timer to the only caller that could
    trip over it (blk_start_request). It then inverts the calls to
    blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
    the race. I've boot tested this patch, but nothing more.

    Signed-off-by: Jeff Moyer
    Acked-by: Hannes Reinecke
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

30 Oct, 2013

1 commit


29 Oct, 2013

1 commit

  • The flush state machine takes in a struct request, which then is
    submitted multiple times to the underling driver. The old block code
    requeses the same request for each of those, so it does not have an
    issue with tapping into the request pool. The new one on the other hand
    allocates a new request for each of the actualy steps of the flush
    sequence. If have already allocated all of the tags for IO, we will
    fail allocating the flush request.

    Set aside a reserved request just for flushes.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe