15 Apr, 2015

1 commit

  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds
     

12 Apr, 2015

3 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and don't skip access_ok() validation.

    Signed-off-by: Al Viro

    Al Viro
     
  • Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
    pointer in scsi_init_cmd_errh() with the blk-mq code.

    The sense_buffer pointer should have been initialized by the call to
    scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
    some non-repeatable memory corruptor.

    This patch makes sure we initialize the whole struct request allocation
    (and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
    using __GFP_ZERO in the allocation. The old code initialized a couple
    of individual fields, leaving the rest undefined (although many of them
    are then initialized in later phases, like blk_mq_rq_ctx_init() etc.

    It's not entirely clear why this matters, but it's the rigth thing to do
    regardless, and with 4.0 imminent this is the defensive "let's just make
    sure everything is initialized properly" patch.

    Tested-by: Jan Engelhardt
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2015

1 commit

  • Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
    caused blk_stack_limits() to not properly stack queue_limits for stacked
    devices (e.g. DM).

    Fix this regression by establishing lcm_not_zero() and switching
    blk_stack_limits() over to using it.

    DM uses blk_set_stacking_limits() to establish the initial top-level
    queue_limits that are then built up based on underlying devices' limits
    using blk_stack_limits(). In the case of optimal_io_size (io_opt)
    blk_set_stacking_limits() establishes a default value of 0. With commit
    69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
    the underlying devices' io_opt.

    Test:
    $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
    $ cat /sys/block/sde/queue/optimal_io_size
    786432
    $ dmsetup create node --table "0 100 linear /dev/sde 0"

    Before this fix:
    $ cat /sys/block/dm-5/queue/optimal_io_size
    0

    After this fix:
    $ cat /sys/block/dm-5/queue/optimal_io_size
    786432

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # 3.19+
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

20 Mar, 2015

1 commit

  • Use the right array index to reference the last
    element of rq->biotail->bi_io_vec[]

    Signed-off-by: Wenbo Wang
    Reviewed-by: Chong Yuan
    Fixes: 66cb45aa41315 ("block: add support for limiting gaps in SG lists")
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Wenbo Wang
     

19 Mar, 2015

1 commit

  • When allocating from the reserved tags pool, bt_get() is called with
    a NULL hctx. If all tags are in use, the hw queue is kicked to push
    out any pending IO, potentially freeing tags, and tag allocation is
    retried. The problem is that blk_mq_run_hw_queue() doesn't check for
    a NULL hctx. So we avoid it with a simple NULL hctx test.

    Tested by hammering mtip32xx with concurrent smartctl/hdparm.

    Signed-off-by: Sam Bradshaw
    Signed-off-by: Selvan Mani
    Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()")
    Cc: stable@kernel.org

    Added appropriate comment.

    Signed-off-by: Jens Axboe

    Sam Bradshaw
     

13 Mar, 2015

1 commit


21 Feb, 2015

1 commit

  • When reading blkio.throttle.io_serviced in a recently created blkio
    cgroup, it's possible to race against the creation of a throttle policy,
    which delays the allocation of stats_cpu.

    Like other functions in the throttle code, just checking for a NULL
    stats_cpu prevents the following oops caused by that race.

    [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
    [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
    [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
    [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
    [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
    [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
    [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
    [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0)
    [ 1137.734230] MSR: 9000000000009032 CR: 42008884 XER: 20000000
    [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
    GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
    GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
    GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
    GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
    GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
    [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
    [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
    [ 1137.734943] Call Trace:
    [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
    [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
    [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
    [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
    [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
    [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
    [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
    [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
    [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
    [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
    [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
    [ 1137.735383] Instruction dump:
    [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
    [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 e9090008 e9490010 e9290018

    And here is one code that allows to easily reproduce this, although this
    has first been found by running docker.

    void run(pid_t pid)
    {
    int n;
    int status;
    int fd;
    char *buffer;
    buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
    n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
    fd = open(CGPATH "/test/tasks", O_WRONLY);
    write(fd, buffer, n);
    close(fd);
    if (fork() > 0) {
    fd = open("/dev/sda", O_RDONLY | O_DIRECT);
    read(fd, buffer, 512);
    close(fd);
    wait(&status);
    } else {
    fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
    n = read(fd, buffer, BUFFER_SIZE);
    close(fd);
    }
    free(buffer);
    exit(0);
    }

    void test(void)
    {
    int status;
    mkdir(CGPATH "/test", 0666);
    if (fork() > 0)
    wait(&status);
    else
    run(getpid());
    rmdir(CGPATH "/test");
    }

    int main(int argc, char **argv)
    {
    int i;
    for (i = 0; i < NR_TESTS; i++)
    test();
    return 0;
    }

    Reported-by: Ricardo Marin Matinata
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Thadeu Lima de Souza Cascardo
     

13 Feb, 2015

3 commits

  • Pull block driver changes from Jens Axboe:
    "This contains:

    - The 4k/partition fixes for brd from Boaz/Matthew.

    - A few xen front/back block fixes from David Vrabel and Roger Pau
    Monne.

    - Floppy changes from Takashi, cleaning the device file creation.

    - Switching libata to use the new blk-mq tagging policy, removing
    code (and a suboptimal implementation) from libata. This will
    throw you a merge conflict, since a bug in the original libata
    tagging code was fixed since this code was branched. Trivial.
    From Shaohua.

    - Conversion of loop to blk-mq, from Ming Lei.

    - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
    He claims it improves on unreadable code, which will cost him a
    beer.

    - Maintainer update or NDB, now handled by Markus Pargmann.

    - NVMe:
    - Optimization from me that avoids a kmalloc/kfree per IO for
    smaller (t handle REQ_FUA explicitly
    block: loop: introduce lo_discard() and lo_req_flush()
    block: loop: say goodby to bio
    block: loop: improve performance via blk-mq

    Linus Torvalds
     
  • Pull core block IO changes from Jens Axboe:
    "This contains:

    - A series from Christoph that cleans up and refactors various parts
    of the REQ_BLOCK_PC handling. Contributions in that series from
    Dongsu Park and Kent Overstreet as well.

    - CFQ:
    - A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
    - A stable patch fixing a potential crash in CFQ in OOM
    situations. From Konstantin Khlebnikov.

    - blk-mq:
    - Add support for tag allocation policies, from Shaohua. This is
    a prep patch enabling libata (and other SCSI parts) to use the
    blk-mq tagging, instead of rolling their own.
    - Various little tweaks from Keith and Mike, in preparation for
    DM blk-mq support.
    - Minor little fixes or tweaks from me.
    - A double free error fix from Tony Battersby.

    - The partition 4k issue fixes from Matthew and Boaz.

    - Add support for zero+unprovision for blkdev_issue_zeroout() from
    Martin"

    * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
    block: remove unused function blk_bio_map_sg
    block: handle the null_mapped flag correctly in blk_rq_map_user_iov
    blk-mq: fix double-free in error path
    block: prevent request-to-request merging with gaps if not allowed
    blk-mq: make blk_mq_run_queues() static
    dm: fix multipath regression due to initializing wrong request
    cfq-iosched: handle failure of cfq group allocation
    block: Quiesce zeroout wrapper
    block: rewrite and split __bio_copy_iov()
    block: merge __bio_map_user_iov into bio_map_user_iov
    block: merge __bio_map_kern into bio_map_kern
    block: pass iov_iter to the BLOCK_PC mapping functions
    block: add a helper to free bio bounce buffer pages
    block: use blk_rq_map_user_iov to implement blk_rq_map_user
    block: simplify bio_map_kern
    block: mark blk-mq devices as stackable
    block: keep established cmd_flags when cloning into a blk-mq request
    block: add blk-mq support to blk_insert_cloned_request()
    block: require blk_rq_prep_clone() be given an initialized clone request
    blk-mq: add tag allocation policy
    ...

    Linus Torvalds
     
  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

12 Feb, 2015

4 commits


11 Feb, 2015

1 commit


10 Feb, 2015

2 commits

  • Pull EFI updates from Ingo Molnar:
    "Main changes:

    - Move efivarfs from the misc filesystem section to pseudo filesystem

    - Expose firmware platform size in sysfs

    - Improve robustness of get_memory_map() by removing assumptions on
    the size of efi_memory_desc_t.

    - various cleanups and fixes

    The biggest risk is the get_memory_map() change, which changes the way
    that both the arm64 and x86 EFI boot stub build the early memory map.
    There are no known regressions with it at the moment, BYMMV"

    * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    efi: Don't look for chosen@0 node on DT platforms
    firmware: efi: Remove unneeded guid unparse
    efi/libstub: Call get_memory_map() to obtain map and desc sizes
    efi: Small leak on error in runtime map code
    efi: rtc-efi: Mark UIE as unsupported
    arm64/efi: efistub: Apply __init annotation
    efi: Expose underlying UEFI firmware platform size to userland
    efi: Rename efi_guid_unparse to efi_guid_to_str
    efi: Update the URLs for efibootmgr
    fs: Make efivarfs a pseudo filesystem, built by default with EFI

    Linus Torvalds
     
  • Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
    In cfq_find_alloc_queue() possible allocation failure is not handled.
    As a result kernel oopses on NULL pointer dereference when
    cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.

    Bug was introduced in v3.5 in commit cd1604fab4f9 ("blkcg: factor
    out blkio_group creation"). Prior to that commit cfq group lookup
    had returned pointer to root group as fallback.

    This patch handles this error using existing fallback oom_cfqq.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Tejun Heo
    Acked-by: Vivek Goyal
    Fixes: cd1604fab4f9 ("blkcg: factor out blkio_group creation")
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

06 Feb, 2015

8 commits

  • blkdev_issue_zeroout() printed a warning if a device failed a discard or
    write same request despite advertising support for these. That's fine
    for SCSI since we'll disable these commands if we get an error back from
    the disk saying that they are not supported. And consequently the
    warning only gets printed once.

    There are other types of block devices that support discard, however,
    and these may return -EOPNOTSUPP for each command but leave discard
    enabled in the queue limits. This will cause a warning message for every
    blkdev_issue_zeroout() invocation.

    Remove the offending warning messages.

    Reported-by: Sedat Dilek
    Signed-off-by: Martin K. Petersen
    Tested-by: Sedat Dilek
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Rewrite __bio_copy_iov using the copy_page_{from,to}_iter helpers, and
    split it into two simpler functions.

    This commit should contain only literal replacements, without
    functional changes.

    Cc: Kent Overstreet
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Dongsu Park
    [hch: removed the __bio_copy_iov wrapper]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Dongsu Park
     
  • And also remove the unused bdev argument.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This saves a little code, and allow to simplify the error handling.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Make use of a new interface provided by iov_iter, backed by
    scatter-gather list of iovec, instead of the old interface based on
    sg_iovec. Also use iov_iter_advance() instead of manual iteration.

    This commit should contain only literal replacements, without
    functional changes.

    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Doug Gilbert
    Cc: "James E.J. Bottomley"
    Signed-off-by: Kent Overstreet
    [dpark: add more description in commit message]
    Signed-off-by: Dongsu Park
    [hch: fixed to do a deep clone of the iov_iter, and to properly use
    the iov_iter direction]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • The code sniplet to walk all bio_vecs and free their pages is opencoded in
    way to many places, so factor it into a helper. Also convert the slightly
    more complex cases in bio_kern_endio and __bio_copy_iov where we break
    the freeing from an existing loop into a separate one.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Just open code the trivial mapping from a kernel virtual address to
    a bio instead of going through the complex user address mapping
    machinery.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

05 Feb, 2015

1 commit

  • It took me a few tries to figure out what this code did; lets rewrite
    it into a more regular form.

    The thing that makes this one 'special' is the BSG_F_BLOCK flag, if
    that is not set we're not supposed/allowed to block and should spin
    wait for completion.

    The (new) io_wait_event() will never see a false condition in case of
    the spinning and we will therefore not block.

    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Peter Zijlstra
     

30 Jan, 2015

3 commits

  • Pull EFI updates from Matt Fleming:

    " - Move efivarfs from the misc filesystem section to pseudo filesystem,
    since that's a more logical and accurate place - Leif Lindholm

    - Update efibootmgr URL in Kconfig help - Peter Jones

    - Improve accuracy of EFI guid function names - Borislav Petkov

    - Expose firmware platform size in sysfs for the benefit of EFI boot
    loader installers and other utilities - Steve McIntyre

    - Cleanup __init annotations for arm64/efi code - Ard Biesheuvel

    - Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel

    - Fix memory leak in error code path of runtime map code - Dan Carpenter

    - Improve robustness of get_memory_map() by removing assumptions on the
    size of efi_memory_desc_t (which could change in future spec
    versions) and querying the firmware instead of guessing about the
    memmap size - Ard Biesheuvel

    - Remove superfluous guid unparse calls - Ivan Khoronzhuk

    - Delete unnecessary chosen@0 DT node FDT code since was duplicated
    from code in drivers/of and is entirely unnecessary - Leif Lindholm

    There's nothing super scary, mainly cleanups, and a merge from Ricardo who
    kindly picked up some patches from the linux-efi mailing list while I
    was out on annual leave in December.

    Perhaps the biggest risk is the get_memory_map() change from Ard, which
    changes the way that both the arm64 and x86 EFI boot stub build the
    early memory map. It would be good to have it bake in linux-next for a
    while.
    "

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
    before the kobject is released because driver core can access it freely
    before its release.

    We can't do that in all ctx/hctx/mq_kobj's release handler because
    it can be run before blk_cleanup_queue().

    Given mq_kobj shouldn't have been introduced, this patch simply moves
    mq's release into blk_release_queue().

    Reported-by: Sasha Levin
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This reverts commit 76d697d10769048e5721510100bf3a9413a56385.

    The commit 76d697d10769048 causes general protection fault
    reported from Bart Van Assche:

    https://lkml.org/lkml/2015/1/28/334

    Reported-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Jan, 2015

3 commits

  • blk_mq_alloc_request() may establish REQ_MQ_INFLIGHT in addition to
    incrementing the hctx->nr_active count. Any cmd_flags that are
    established in the newly allocated clone request must be preserved in
    addition to the cmd_flags that are later copied over from the original
    request as part of blk_rq_prep_clone().

    Otherwise, if REQ_MQ_INFLIGHT isn't set in the clone request the
    hctx->nr_active count won't get decremented via blk_mq_free_request().

    The only consumer of blk_rq_prep_clone() is request-based DM, which uses
    blk_rq_init() prior to calling blk_rq_prep_clone() for the non-blk-mq
    case. Given the cloned request's cmd_flags will be 0 it is safe to OR
    them with the original request's cmd_flags for both the non-blk-mq and
    blk-mq cases.

    Reported-by: Bart Van Assche
    Signed-off-by: Keith Busch
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • If the request passed to blk_insert_cloned_request() was allocated by
    a blk-mq device it must be submitted using blk_mq_insert_request().

    Signed-off-by: Keith Busch
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Prepare to allow blk_rq_prep_clone() to accept clone requests that were
    allocated from blk-mq request queues. As such the blk_rq_prep_clone()
    caller must first initialize the clone request.

    Signed-off-by: Keith Busch
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Keith Busch
     

24 Jan, 2015

2 commits

  • This is the blk-mq part to support tag allocation policy. The default
    allocation policy isn't changed (though it's not a strict FIFO). The new
    policy is round-robin for libata. But it's a try-best implementation. If
    multiple tasks are competing, the tags returned will be mixed (which is
    unavoidable even with !mq, as requests from different tasks can be
    mixed in queue)

    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • The libata tag allocation is using a round-robin policy. Next patch will
    make libata use block generic tag allocation, so let's add a policy to
    tag allocation.

    Currently two policies: FIFO (default) and round-robin.

    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

22 Jan, 2015

3 commits

  • As Christoph put it:
    Can we just get rid of the warnings? It's fairly annoying as devices
    without partitions are perfectly fine and very useful.

    Me too I see this message every VM boot for ages on all my
    devices. Would love to just remove it. For me a partition-table
    is only needed for a booting BIOS, grub, and stuff.

    CC: Christoph Hellwig
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Boaz Harrosh
     
  • blkdev_issue_discard() will zero a given block range. This is done by
    way of explicit writing, thus provisioning or allocating the blocks on
    disk.

    There are use cases where the desired behavior is to zero the blocks but
    unprovision them if possible. The blocks must deterministically contain
    zeroes when they are subsequently read back.

    This patch adds a flag to blkdev_issue_zeroout() that provides this
    variant. If the discard flag is set and a block device guarantees
    discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
    the device does not support discard_zeroes_data or if the discard
    request fails we will fall back to first REQ_WRITE_SAME and then a
    regular REQ_WRITE.

    Also update the callers of blkdev_issue_zero() to reflect the new flag
    and make sb_issue_zeroout() prefer the discard approach.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Hi,

    If you can manage to submit an async write as the first async I/O from
    the context of a process with realtime scheduling priority, then a
    cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It
    ends up in the best effort array, but actually has realtime I/O
    scheduling priority set in cfqq->ioprio.

    The reason is that cfq_get_queue assumes the default scheduling class and
    priority when there is no information present (i.e. when the async cfqq
    is created):

    static struct cfq_queue *
    cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
    struct bio *bio, gfp_t gfp_mask)
    {
    const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
    const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);

    cic->ioprio starts out as 0, which is "invalid". So, class of 0
    (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:

    async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);

    static struct cfq_queue **
    cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
    {
    switch (ioprio_class) {
    case IOPRIO_CLASS_RT:
    return &cfqd->async_cfqq[0][ioprio];
    case IOPRIO_CLASS_NONE:
    ioprio = IOPRIO_NORM;
    /* fall through */
    case IOPRIO_CLASS_BE:
    return &cfqd->async_cfqq[1][ioprio];
    case IOPRIO_CLASS_IDLE:
    return &cfqd->async_idle_cfqq;
    default:
    BUG();
    }
    }

    Here, instead of returning a class mapped from the process' scheduling
    priority, we get back the bucket associated with IOPRIO_CLASS_BE.

    Now, there is no queue allocated there yet, so we create it:

    cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);

    That function ends up doing this:

    cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
    cfq_init_prio_data(cfqq, cic);

    cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio
    data does this:

    ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
    switch (ioprio_class) {
    default:
    printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
    case IOPRIO_CLASS_NONE:
    /*
    * no prio set, inherit CPU scheduling settings
    */
    cfqq->ioprio = task_nice_ioprio(tsk);
    cfqq->ioprio_class = task_nice_ioclass(tsk);
    break;

    So we basically have two code paths that treat IOPRIO_CLASS_NONE
    differently, which results in an RT async cfqq filed into a best effort
    bucket.

    Attached is a patch which fixes the problem. I'm not sure how to make
    it cleaner. Suggestions would be welcome.

    Signed-off-by: Jeff Moyer
    Tested-by: Hidehiro Kawai
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

21 Jan, 2015

1 commit

  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig