02 Dec, 2014

1 commit

  • bio integrity handling is broken on a system with LVM layered atop a
    DIF/DIX SCSI drive because device mapper clones the bio, modifies the
    clone, and sends the clone to the lower layers for processing.
    However, the clone bio has bi_vcnt == 0, which means that when the sd
    driver calls bio_integrity_process to attach DIX data, the
    for_each_segment_all() call (which uses bi_vcnt) returns immediately
    and random garbage is sent to the disk on a disk write. The disk of
    course returns an error.

    Therefore, teach bio_integrity_process() to use bio_for_each_segment()
    to iterate the bio_vecs, since the per-bio iterator tracks which
    bio_vecs are associated with that particular bio. The integrity
    handling code is effectively part of the "driver" (it's not the bio
    owner), so it must use the correct iterator function.

    v2: Fix a compiler warning about abandoned local variables. This
    patch supersedes "block: bio_integrity_process uses wrong bio_vec
    iterator". Patch applies against 3.18-rc6.

    Signed-off-by: Darrick J. Wong
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Darrick J. Wong
     

12 Nov, 2014

1 commit


11 Nov, 2014

1 commit


05 Nov, 2014

1 commit

  • q->mq_usage_counter is a percpu_ref which is killed and drained when
    the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit()
    which involves freezing the queue is invoked on all existing queues.
    Because percpu_ref killing and draining involve a RCU grace period,
    doing the above on one queue after another may take a long time if
    there are many queues on the system.

    This patch splits out initiation of freezing and waiting for its
    completion, and updates blk_mq_queue_reinit_notify() so that the
    queues are frozen in parallel instead of one after another. Note that
    freezing and unfreezing are moved from blk_mq_queue_reinit() to
    blk_mq_queue_reinit_notify().

    Signed-off-by: Tejun Heo
    Reported-by: Christian Borntraeger
    Tested-by: Christian Borntraeger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

31 Oct, 2014

1 commit

  • Priority of a merged request is computed by ioprio_best(). If one of the
    requests has undefined priority (IOPRIO_CLASS_NONE) and another request
    has priority from IOPRIO_CLASS_BE, the function will return the
    undefined priority which is wrong. Fix the function to properly return
    priority of a request with the defined priority.

    Fixes: d58cdfb89ce0c6bd5f81ae931a984ef298dbda20
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jan Kara
     

24 Oct, 2014

1 commit

  • while compiling integer err was showing as a set but unused variable.
    elevator_init_fn can be either cfq_init_queue or deadline_init_queue
    or noop_init_queue.
    all three of these functions are returning -ENOMEM if they fail to
    allocate the queue.
    so we should actually be returning the error code rather than
    returning 0 always.

    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Jens Axboe

    Sudip Mukherjee
     

23 Oct, 2014

1 commit

  • When sg_scsi_ioctl() fails to prepare request to submit in
    blk_rq_map_kern() we jump to a label where we just end up copying
    (luckily zeroed-out) kernel buffer to userspace instead of reporting
    error. Fix the problem by jumping to the right label.

    CC: Jens Axboe
    CC: linux-scsi@vger.kernel.org
    CC: stable@vger.kernel.org
    Coverity-id: 1226871
    Signed-off-by: Jan Kara

    Fixed up the, now unused, out label.

    Signed-off-by: Jens Axboe

    Jan Kara
     

22 Oct, 2014

1 commit

  • The problem is introduced by commit 764f612c6c3c231b(blk-merge:
    don't compute bi_phys_segments from bi_vcnt for cloned bio),
    and merge is needed if number of current segment isn't less than
    max segments.

    Strictly speaking, bio->bi_vcnt shouldn't be used here since
    it may not be accurate in cases of both cloned bio or bio cloned
    from, but bio_segments() is a bit expensive, and bi_vcnt is still
    the biggest number, so the approach should work.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

14 Oct, 2014

2 commits


13 Oct, 2014

2 commits

  • In __get_request calls to printk_ratelimited, include the function name so
    the callbacks suppressed message matches the messages that are printed,
    and add "dev" before the device name so it matches other block layer
    messages.

    Signed-off-by: Robert Elliott
    Reviewed-by: Webb Scales
    Signed-off-by: Jens Axboe

    Robert Elliott
     
  • In blk_update_request, change the printk_ratelimited
    prefix from end_request to blk_update_request so it
    matches the name printed if rate limiting occurs.

    Old:
    [10234.933106] blk_update_request: 174 callbacks suppressed
    [10234.934940] end_request: critical target error, dev sdr, sector 16
    [10234.949788] end_request: critical target error, dev sdr, sector 16

    New:
    [16863.445173] blk_update_request: 398 callbacks suppressed
    [16863.447029] blk_update_request: critical target error, dev sdr, sector
    1442066176
    [16863.449383] blk_update_request: critical target error, dev sdr, sector
    802802888
    [16863.451680] blk_update_request: critical target error, dev sdr, sector
    1609535456

    Signed-off-by: Robert Elliott
    Reviewed-by: Webb Scales
    Signed-off-by: Jens Axboe

    Robert Elliott
     

10 Oct, 2014

2 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of activities on percpu front. Notable changes are...

    - percpu allocator now can take @gfp. If @gfp doesn't contain
    GFP_KERNEL, it tries to allocate from what's already available to
    the allocator and a work item tries to keep the reserve around
    certain level so that these atomic allocations usually succeed.

    This will replace the ad-hoc percpu memory pool used by
    blk-throttle and also be used by the planned blkcg support for
    writeback IOs.

    Please note that I noticed a bug in how @gfp is interpreted while
    preparing this pull request and applied the fix 6ae833c7fe0c
    ("percpu: fix how @gfp is interpreted by the percpu allocator")
    just now.

    - percpu_ref now uses longs for percpu and global counters instead of
    ints. It leads to more sparse packing of the percpu counters on
    64bit machines but the overhead should be negligible and this
    allows using percpu_ref for refcnting pages and in-memory objects
    directly.

    - The switching between percpu and single counter modes of a
    percpu_ref is made independent of putting the base ref and a
    percpu_ref can now optionally be initialized in single or killed
    mode. This allows avoiding percpu shutdown latency for cases where
    the refcounted objects may be synchronously created and destroyed
    in rapid succession with only a fraction of them reaching fully
    operational status (SCSI probing does this when combined with
    blk-mq support). It's also planned to be used to implement forced
    single mode to detect underflow more timely for debugging.

    There's a separate branch percpu/for-3.18-consistent-ops which cleans
    up the duplicate percpu accessors. That branch causes a number of
    conflicts with s390 and other trees. I'll send a separate pull
    request w/ resolutions once other branches are merged"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
    percpu: fix how @gfp is interpreted by the percpu allocator
    blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
    percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
    percpu_ref: add PERCPU_REF_INIT_* flags
    percpu_ref: decouple switching to percpu mode and reinit
    percpu_ref: decouple switching to atomic mode and killing
    percpu_ref: add PCPU_REF_DEAD
    percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
    percpu_ref: replace pcpu_ prefix with percpu_
    percpu_ref: minor code and comment updates
    percpu_ref: relocate percpu_ref_reinit()
    Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
    Revert "percpu: free percpu allocation info for uniprocessor system"
    percpu-refcount: make percpu_ref based on longs instead of ints
    percpu-refcount: improve WARN messages
    percpu: fix locking regression in the failure path of pcpu_alloc()
    percpu-refcount: add @gfp to percpu_ref_init()
    proportions: add @gfp to init functions
    percpu_counter: add @gfp to percpu_counter_init()
    percpu_counter: make percpu_counters_lock irq-safe
    ...

    Linus Torvalds
     
  • It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt
    if the bio is cloned.

    Signed-off-by: Ming Lei
    Tested-by: Jeff Mahoney
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Oct, 2014

1 commit

  • The math in both blk_stack_limits() and queue_limit_alignment_offset()
    assume that a block device's io_min (aka minimum_io_size) is always a
    power-of-2. Fix the math such that it works for non-power-of-2 io_min.

    This issue (of alignment_offset != 0) became apparent when testing
    dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
    1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
    block size") unlocked the potential for alignment_offset != 0 due to
    the dm-thin-pool's io_min possibly being a non-power-of-2.

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

07 Oct, 2014

2 commits


04 Oct, 2014

2 commits

  • Users of bio_clone_fast() do not want bios with their own bvecs.
    Allocating a bvec mempool as part of the bioset intended for such users
    is a waste of memory.

    bioset_create_nobvec() creates a bioset that doesn't have the bvec
    mempool.

    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     
  • Request cloning clones bios in the request to track the completion
    of each bio.
    For that purpose, we can use bio_clone_fast() instead of bio_clone()
    to avoid unnecessary allocation and copy of bvecs.

    This patch reduces memory footprint of request-based device-mapper
    (about 1-4KB for each request) and is a preparation for further
    reduction of memory usage by removing unused bvec mempool.

    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

01 Oct, 2014

1 commit


28 Sep, 2014

1 commit

  • The kernel used to contain two functions for length-delimited,
    case-insensitive string comparison, strnicmp with correct semantics
    and a slightly buggy strncasecmp. The latter is the POSIX name, so
    strnicmp was renamed to strncasecmp, and strnicmp made into a wrapper
    for the new strncasecmp to avoid breaking existing users.

    To allow the compat wrapper strnicmp to be removed at some point in
    the future, and to avoid the extra indirection cost, do
    s/strnicmp/strncasecmp/g.

    Cc: Jens Axboe
    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Jens Axboe

    Rasmus Villemoes
     

27 Sep, 2014

13 commits

  • The T10 Protection Information format is also used by some devices that
    do not go through the SCSI layer (virtual block devices, NVMe). Relocate
    the relevant functions to a block layer library that can be used without
    involving SCSI.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • We'd occasionally merge requests with conflicting integrity flags.
    Introduce a merge helper which checks that the requests have compatible
    integrity payloads.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Make the choice of checksum a per-I/O property by introducing a flag
    that can be inspected by the SCSI layer. There are several reasons for
    this:

    1. It allows us to switch choice of checksum without unloading and
    reloading the HBA driver.

    2. During error recovery we need to be able to tell the HBA that
    checksums read from disk should not be verified and converted to IP
    checksums.

    3. For error injection purposes we need to be able to write a bad guard
    tag to storage. Since the storage device only supports T10 CRC we
    need to be able to disable IP checksum conversion on the HBA.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Move flags affecting the integrity code out of the bio bi_flags and into
    the block integrity payload.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • So far we have relied on the app tag size to determine whether a disk
    has been formatted with T10 protection information or not. However, not
    all target devices provide application tag storage.

    Add a flag to the block integrity profile that indicates whether the
    disk has been formatted with protection information.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Add a BLK_ prefix to the integrity profile flags. Also rename the flags
    to be more consistent with the generate/verify terminology in the rest
    of the integrity code.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Instead of the "operate" parameter we pass in a seed value and a pointer
    to a function that can be used to process the integrity metadata. The
    generation function is changed to have a return value to fit into this
    scheme.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Now that the protection interval has been detached from the sector size
    we need to be able to handle sizes that are different from 4K and
    512. Make the interval calculation generic.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The protection interval is not necessarily tied to the logical block
    size of a block device. Stop using the terms "sector" and "sectors".

    Going forward we will use the term "seed" to describe the initial
    reference tag value for a given I/O. "Interval" will be used to describe
    the portion of the data buffer that a given piece of protection
    information is associated with.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • bip_buf is not really needed so we can remove it.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • None of the filesystems appear interested in using the integrity tagging
    feature. Potentially because very few storage devices actually permit
    using the application tag space.

    Remove the tagging functions.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • For commands like REQ_COPY we need a way to pass extra information along
    with each bio. Like integrity metadata this information must be
    available at the bottom of the stack so bi_private does not suffice.

    Rename the existing bi_integrity field to bi_special and make it a union
    so we can have different bio extensions for each class of command.

    We previously used bi_integrity != NULL as a way to identify whether a
    bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the
    indicator now that bi_special can contain different things.

    In addition, bio_integrity(bio) will now return a pointer to the
    integrity payload (when applicable).

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • bdev_integrity_enabled() is only used by bio_integrity_enabled().
    Combine these two functions.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

26 Sep, 2014

4 commits

  • This patch supports to run one single flush machinery for
    each blk-mq dispatch queue, so that:

    - current init_request and exit_request callbacks can
    cover flush request too, then the buggy copying way of
    initializing flush request's pdu can be fixed

    - flushing performance gets improved in case of multi hw-queue

    In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
    iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
    increased a lot over my test environment:
    - throughput: +70% in case of virtio-blk over null_blk
    - throughput: +30% in case of virtio-blk over SSD image

    The multi virtqueue feature isn't merged to QEMU yet, and patches for
    the feature can be found in below tree:

    git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

    And simply passing 'num_queues=4 vectors=5' should be enough to
    enable multi queue(quad queue) feature for QEMU virtio-blk.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
    so that this function can find the corresponding blk_flush_queue
    bound with current mq context since the flush queue will become
    per hw-queue.

    For legacy queue, the parameter can be simply 'NULL'.

    For multiqueue case, the parameter should be set as the context
    from which the related request is originated. With this context
    info, the hw queue and related flush queue can be found easily.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Just figuring out flush queue at the entry of kicking off flush
    machinery and request's completion handler, then pass it through.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now mission of the two helpers is over, and just call
    blk_alloc_flush_queue() and blk_free_flush_queue() directly.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei