25 Aug, 2017

1 commit

  • The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED
    and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to
    blk-mq-debugfs.c such that these appear as names in debugfs instead of
    as numbers.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

24 Aug, 2017

2 commits

  • Since we split the scsi_request out of struct request bsg fails to
    provide a reply-buffer for the drivers. This was done via the pointer
    for sense-data, that is not preallocated anymore.

    Failing to allocate/assign it results in illegal dereferences because
    LLDs use this pointer unquestioned.

    An example panic on s390x, using the zFCP driver, looks like this (I had
    debugging on, otherwise NULL-pointer dereferences wouldn't even panic on
    s390x):

    Unable to handle kernel pointer dereference in virtual kernel address space
    Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6403
    Fault in home space mode while using kernel ASCE.
    AS:0000000001590007 R3:0000000000000024
    Oops: 0038 ilc:2 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-bsg-regression+ #3
    Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
    task: 0000000065cb0100 task.stack: 0000000065cb4000
    Krnl PSW : 0704e00180000000 000003ff801e4156 (zfcp_fc_ct_els_job_handler+0x16/0x58 [zfcp])
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 0000000000000001 000000005fa9d0d0 000000005fa9d078 0000000000e16866
    000003ff00000290 6b6b6b6b6b6b6b6b 0000000059f78f00 000000000000000f
    00000000593a0958 00000000593a0958 0000000060d88800 000000005ddd4c38
    0000000058b50100 07000000659cba08 000003ff801e8556 00000000659cb9a8
    Krnl Code: 000003ff801e4146: e31020500004 lg %r1,80(%r2)
    000003ff801e414c: 58402040 l %r4,64(%r2)
    #000003ff801e4150: e35020200004 lg %r5,32(%r2)
    >000003ff801e4156: 50405004 st %r4,4(%r5)
    000003ff801e415a: e54c50080000 mvhi 8(%r5),0
    000003ff801e4160: e33010280012 lt %r3,40(%r1)
    000003ff801e4166: a718fffb lhi %r1,-5
    000003ff801e416a: 1803 lr %r0,%r3
    Call Trace:
    ([] zfcp_fsf_req_complete+0x726/0x768 [zfcp])
    [] zfcp_fsf_reqid_check+0x102/0x180 [zfcp]
    [] zfcp_qdio_int_resp+0x230/0x278 [zfcp]
    [] qdio_kick_handler+0x2ae/0x2c8
    [] __tiqdio_inbound_processing+0x406/0xc10
    [] tasklet_action+0x15a/0x1d8
    [] __do_softirq+0x3ec/0x848
    [] irq_exit+0x74/0xf8
    [] do_IRQ+0xba/0xf0
    [] io_int_handler+0x104/0x2d4
    [] enabled_wait+0xb6/0x188
    ([] enabled_wait+0x9e/0x188)
    [] arch_cpu_idle+0x32/0x50
    [] default_idle_call+0x52/0x68
    [] do_idle+0x102/0x188
    [] cpu_startup_entry+0x3e/0x48
    [] smp_start_secondary+0x11c/0x130
    [] restart_int_handler+0x62/0x78
    [] (null)
    INFO: lockdep is turned off.
    Last Breaking-Event-Address:
    [] zfcp_fc_ct_job_handler+0x3e/0x48 [zfcp]

    Kernel panic - not syncing: Fatal exception in interrupt

    This patch moves bsg-lib to allocate and setup struct bsg_job ahead of
    time, including the allocation of a buffer for the reply-data.

    This means, struct bsg_job is not allocated separately anymore, but as part
    of struct request allocation - similar to struct scsi_cmd. Reflect this in
    the function names that used to handle creation/destruction of struct
    bsg_job.

    Reported-by: Steffen Maier
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Benjamin Block
    Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request")
    Cc: #4.11+
    Signed-off-by: Jens Axboe

    Benjamin Block
     
  • discard request usually is very big and easily use all bandwidth budget
    of a cgroup. discard request size doesn't really mean the size of data
    written, so it doesn't make sense to account it into bandwidth budget.
    Jens pointed out treating the size 0 doesn't make sense too, because
    discard request does have cost. But it's not easy to find the actual
    cost. This patch simply makes the size one sector.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

18 Aug, 2017

1 commit


15 Aug, 2017

1 commit

  • blk_mq_get_request() does not release the callers queue usage counter
    when allocation fails. The caller still needs to account for its own
    queue usage when it is unable to allocate a request.

    Fixes: 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")

    Reported-by: Max Gurtovoy
    Reviewed-by: Ming Lei
    Reviewed-by: Sagi Grimberg
    Tested-by: Max Gurtovoy
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

10 Aug, 2017

3 commits

  • The blk_mq_delay_kick_requeue_list() function is used by the device
    mapper and only by the device mapper to rerun the queue and requeue
    list after a delay. This function is called once per request that
    gets requeued. Modify this function such that the queue is run once
    per path change event instead of once per request that is requeued.

    Fixes: commit 2849450ad39d ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
    Signed-off-by: Bart Van Assche
    Cc: Mike Snitzer
    Cc: Laurence Oberman
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This gets us back to the behavior in 4.12 and earlier.

    Signed-off-by: Christoph Hellwig
    Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • In dm-integrity target we register integrity profile that have
    both generate_fn and verify_fn callbacks set to NULL.

    This is used if dm-integrity is stacked under a dm-crypt device
    for authenticated encryption (integrity payload contains authentication
    tag and IV seed).

    In this case the verification is done through own crypto API
    processing inside dm-crypt; integrity profile is only holder
    of these data. (And memory is owned by dm-crypt as well.)

    After the commit (and previous changes)
    Commit 7c20f11680a441df09de7235206f70115fbf6290
    Author: Christoph Hellwig
    Date: Mon Jul 3 16:58:43 2017 -0600

    bio-integrity: stop abusing bi_end_io

    we get this crash:

    : BUG: unable to handle kernel NULL pointer dereference at (null)
    : IP: (null)
    : *pde = 00000000
    ...
    :
    : Workqueue: kintegrityd bio_integrity_verify_fn
    : task: f48ae180 task.stack: f4b5c000
    : EIP: (null)
    : EFLAGS: 00210286 CPU: 0
    : EAX: f4b5debc EBX: 00001000 ECX: 00000001 EDX: 00000000
    : ESI: 00001000 EDI: ed25f000 EBP: f4b5dee8 ESP: f4b5dea4
    : DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    : CR0: 80050033 CR2: 00000000 CR3: 32823000 CR4: 001406d0
    : Call Trace:
    : ? bio_integrity_process+0xe3/0x1e0
    : bio_integrity_verify_fn+0xea/0x150
    : process_one_work+0x1c7/0x5c0
    : worker_thread+0x39/0x380
    : kthread+0xd6/0x110
    : ? process_one_work+0x5c0/0x5c0
    : ? kthread_worker_fn+0x100/0x100
    : ? kthread_worker_fn+0x100/0x100
    : ret_from_fork+0x19/0x24
    : Code: Bad EIP value.
    : EIP: (null) SS:ESP: 0068:f4b5dea4
    : CR2: 0000000000000000

    Patch just skip the whole verify workqueue if verify_fn is set to NULL.

    Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
    Signed-off-by: Milan Broz
    [hch: trivial whitespace fix]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Milan Broz
     

02 Aug, 2017

1 commit


30 Jul, 2017

2 commits

  • Groups of BFQ queues are represented by generic entities in BFQ. When
    a queue belonging to a parent entity is deactivated, the parent entity
    may need to be deactivated too, in case the deactivated queue was the
    only active queue for the parent entity. This deactivation may need to
    be propagated upwards if the entity belongs, in its turn, to a further
    higher-level entity, and so on. In particular, the upward propagation
    of deactivation stops at the first parent entity that remains active
    even if one of its child entities has been deactivated.

    To decide whether the last non-deactivation condition holds for a
    parent entity, BFQ checks whether the field next_in_service is still
    not NULL for the parent entity, after the deactivation of one of its
    child entity. If it is not NULL, then there are certainly other active
    entities in the parent entity, and deactivations can stop.

    Unfortunately, this check misses a corner case: if in_service_entity
    is not NULL, then next_in_service may happen to be NULL, although the
    parent entity is evidently active. This happens if: 1) the entity
    pointed by in_service_entity is the only active entity in the parent
    entity, and 2) according to the definition of next_in_service, the
    in_service_entity cannot be considered as next_in_service. See the
    comments on the definition of next_in_service for details on this
    second point.

    Hitting the above corner case causes crashes.

    To address this issue, this commit:
    1) Extends the above check on only next_in_service to controlling both
    next_in_service and in_service_entity (if any of them is not NULL,
    then no further deactivation is performed)
    2) Improves the (important) comments on how next_in_service is defined
    and updated; in particular it fixes a few rather obscure paragraphs

    Reported-by: Eric Wheeler
    Reported-by: Rick Yiu
    Reported-by: Tom X Nguyen
    Signed-off-by: Paolo Valente
    Tested-by: Eric Wheeler
    Tested-by: Rick Yiu
    Tested-by: Laurentiu Nicola
    Tested-by: Tom X Nguyen
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ implements hierarchical scheduling by representing each group of
    queues with a generic parent entity. For each parent entity, BFQ
    maintains an in_service_entity pointer: if one of the child entities
    happens to be in service, in_service_entity points to it. The
    resetting of these pointers happens only on queue expirations: when
    the in-service queue is expired, i.e., stops to be the queue in
    service, BFQ resets all in_service_entity pointers along the
    parent-entity path from this queue to the root entity.

    Functions handling the scheduling of entities assume, naturally, that
    in-service entities are active, i.e., have pending I/O requests (or,
    as a special case, even if they have no pending requests, they are
    expected to receive a new request very soon, with the scheduler idling
    the storage device while waiting for such an event). Unfortunately,
    the above resetting scheme of the in_service_entity pointers may cause
    this assumption to be violated. For example, the in-service queue may
    happen to remain without requests because of a request merge. In this
    case the queue does become idle, and all related data structures are
    updated accordingly. But in_service_entity still points to the queue
    in the parent entity. This inconsistency may even propagate to
    higher-level parent entities, if they happen to become idle as well,
    as a consequence of the leaf queue becoming idle. For this queue and
    parent entities, scheduling functions have an undefined behaviour,
    and, as reported, may easily lead to kernel crashes or hangs.

    This commit addresses this issue by simply resetting the
    in_service_entity field also when it is detected to point to an entity
    becoming idle (regardless of why the entity becomes idle).

    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Laurentiu Nicola
    Signed-off-by: Jens Axboe

    Paolo Valente
     

25 Jul, 2017

1 commit

  • We already do this for PCI mappings, and the higher level code now
    expects that CPU on/offlining doesn't have an affect on the queue
    mappings.

    Signed-off-by: Christoph Hellwig
    Tested-by: Max Gurtovoy
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jul, 2017

1 commit

  • The blk-mq code lacks support for looking at the rpm_status field, tracking
    active requests and the RQF_PM flag.

    Due to the default switch to blk-mq for scsi people start to run into
    suspend / resume issue due to this fact, so make sure we disable the runtime
    PM functionality until it is properly implemented.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Jul, 2017

3 commits

  • There are mq devices (eg., virtio-blk, nbd and loopback) which don't
    invoke blk_mq_run_hw_queues() after the completion of a request.
    If bfq is enabled on these devices and the slice_idle attribute or
    strict_guarantees attribute is set as zero, it is possible that
    after a request completion the remaining requests of busy bfq queue
    will stalled in the bfq schedule until a new request arrives.

    To fix the scheduler latency problem, we need to check whether or not
    all issued requests have completed and dispatch more requests to driver
    if there is no request in driver.

    The problem can be reproduced by running the following script
    on a virtio-blk device with nr_hw_queues as 1:

    #!/bin/sh

    dev=vdb
    # mount point for dev
    mp=/tmp/mnt
    cd $mp

    job=strict.job
    cat < $job
    [global]
    direct=1
    bs=4k
    size=256M
    rw=write
    ioengine=libaio
    iodepth=128
    runtime=5
    time_based

    [1]
    filename=1.data

    [2]
    new_group
    filename=2.data
    EOF

    echo bfq > /sys/block/$dev/queue/scheduler
    echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
    fio $job

    Signed-off-by: Hou Tao
    Reviewed-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • The start time of eligible entity should be less than or equal to
    the current virtual time, and the entity in idle tree has a finish
    time being greater than the current virtual time.

    Signed-off-by: Hou Tao
    Reviewed-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • Pull more block updates from Jens Axboe:
    "This is a followup for block changes, that didn't make the initial
    pull request. It's a bit of a mixed bag, this contains:

    - A followup pull request from Sagi for NVMe. Outside of fixups for
    NVMe, it also includes a series for ensuring that we properly
    quiesce hardware queues when browsing live tags.

    - Set of integrity fixes from Dmitry (mostly), fixing various issues
    for folks using DIF/DIX.

    - Fix for a bug introduced in cciss, with the req init changes. From
    Christoph.

    - Fix for a bug in BFQ, from Paolo.

    - Two followup fixes for lightnvm/pblk from Javier.

    - Depth fix from Ming for blk-mq-sched.

    - Also from Ming, performance fix for mtip32xx that was introduced
    with the dynamic initialization of commands"

    * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
    block: call bio_uninit in bio_endio
    nvmet: avoid unneeded assignment of submit_bio return value
    nvme-pci: add module parameter for io queue depth
    nvme-pci: compile warnings in nvme_alloc_host_mem()
    nvmet_fc: Accept variable pad lengths on Create Association LS
    nvme_fc/nvmet_fc: revise Create Association descriptor length
    lightnvm: pblk: remove unnecessary checks
    lightnvm: pblk: control I/O flow also on tear down
    cciss: initialize struct scsi_req
    null_blk: fix error flow for shared tags during module_init
    block: Fix __blkdev_issue_zeroout loop
    nvme-rdma: unconditionally recycle the request mr
    nvme: split nvme_uninit_ctrl into stop and uninit
    virtio_blk: quiesce/unquiesce live IO when entering PM states
    mtip32xx: quiesce request queues to make sure no submissions are inflight
    nbd: quiesce request queues to make sure no submissions are inflight
    nvme: kick requeue list when requeueing a request instead of when starting the queues
    nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
    ...

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • bio_free isn't a good place to free cgroup info. There are a
    lot of cases bio is allocated in special way (for example, in stack) and
    never gets called by bio_put hence bio_free, we are leaking memory. This
    patch moves the free to bio endio, which should be called anyway. The
    bio_uninit call in bio_free is kept, in case the bio never gets called
    bio endio.

    This assumes ->bi_end_io() doesn't access cgroup info, which seems true
    in my audit.

    This along with Christoph's integrity patch should fix the memory leak
    issue.

    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Jul, 2017

1 commit

  • Pull misc compat stuff updates from Al Viro:
    "This part is basically untangling various compat stuff. Compat
    syscalls moved to their native counterparts, getting rid of quite a
    bit of double-copying and/or set_fs() uses. A lot of field-by-field
    copyin/copyout killed off.

    - kernel/compat.c is much closer to containing just the
    copyin/copyout of compat structs. Not all compat syscalls are gone
    from it yet, but it's getting there.

    - ipc/compat_mq.c killed off completely.

    - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
    drivers/block/floppy.c where they belong. Yes, there are several
    drivers that implement some of the same ioctls. Some are m68k and
    one is 32bit-only pmac. drivers/block/floppy.c is the only one in
    that bunch that can be built on biarch"

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: move compat syscalls to native ones
    usbdevfs: get rid of field-by-field copyin
    compat_hdio_ioctl: get rid of set_fs()
    take floppy compat ioctls to sodding floppy.c
    ipmi: get rid of field-by-field __get_user()
    ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
    rt_sigtimedwait(): move compat to native
    select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
    put_compat_rusage(): switch to copy_to_user()
    sigpending(): move compat to native
    getrlimit()/setrlimit(): move compat to native
    times(2): move compat to native
    compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
    fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
    do_sigaltstack(): lift copying to/from userland into callers
    take compat_sys_old_getrlimit() to native syscall
    trim __ARCH_WANT_SYS_OLD_GETRLIMIT

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
    with a maximum number of bvec (pages) equal to

    min(nr_sects, (sector_t)BIO_MAX_PAGES)

    This works since the requested number of bvecs will always be limited
    to the absolute maximum number supported (BIO_MAX_PAGES), but this is
    ineficient as too many bvec entries may be requested due to the
    different units being used in the min() operation (number of sectors vs
    number of pages).
    To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
    correctly calculate the number of bvecs for zeroout BIOs as the issuing
    loop progresses. The calculation is done using consistent units and
    makes sure that the number of pages return is at least 1 (for cases
    where the number of sectors is less that the number of sectors in
    a page).

    Also remove a trailing space after the bit shift in the internal loop
    min() call.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

05 Jul, 2017

1 commit

  • block/bio-integrity.c:318:10-11: WARNING: return of 0/1 in function 'bio_integrity_prep' with return type bool

    Return statements in functions returning bool should use
    true/false instead of 1/0.
    Generated by: scripts/coccinelle/misc/boolreturn.cocci

    Fixes: e23947bd76f0 ("bio-integrity: fold bio_integrity_enabled to bio_integrity_prep")
    CC: Dmitry Monakhov
    Signed-off-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    kbuild test robot
     

04 Jul, 2017

13 commits

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     
  • And instead call directly into the integrity code from bio_end_io.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently ->verify_fn not woks at all because at the moment it is called
    bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.

    In order to perform verification we need to know original data vector,
    with new bvec rewind API this is trivial.

    testcase: https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dmitry Monakhov
    [hch: adopted for new status values]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Signed-off-by: Dmitry Monakhov
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Currently all integrity prep hooks are open-coded, and if prepare fails
    we ignore it's code and fail bio with EIO. Let's return real error to
    upper layer, so later caller may react accordingly.

    In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
    so it is reasonable to fold it in to one function.

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Martin K. Petersen
    [hch: merged with the latest block tree,
    return bool from bio_integrity_prep]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • bio_integrity_trim inherent it's interface from bio_trim and accept
    offset and size, but this API is error prone because data offset
    must always be insync with bio's data offset. That is why we have
    integrity update hook in bio_advance()

    So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
    Let's just remove them completely.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • SCSI drivers do care about bip_seed so we must update it accordingly.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • When mq-deadline is taken, IOPS of sequential read and
    seqential write is observed more than 20% drop on sata(scsi-mq)
    devices, compared with using 'none' scheduler.

    The reason is that the default nr_requests for scheduler is
    too big for small queuedepth devices, and latency is increased
    much.

    Since the principle of taking 256 requests for mq scheduler
    is based on 128 queue depth, this patch changes into
    double size of min(hw queue_depth, 128).

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • On each deactivation or re-scheduling (after being served) of a
    bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
    to perform pending updates of ioprio, weight and ioprio class for the
    bfq_queue. BFQ also invokes this function on I/O-request dispatches,
    to raise or lower weights more quickly when needed, thereby improving
    latency. However, the entity representing the bfq_queue may be on the
    active (sub)tree of a service tree when this happens, and, although
    with a very low probability, the bfq_queue may happen to also have a
    pending change of its ioprio class. If both conditions hold when
    __bfq_entity_update_weight_prio() is invoked, then the entity moves to
    a sort of hybrid state: the new service tree for the entity, as
    returned by bfq_entity_service_tree(), differs from service tree on
    which the entity still is. The functions that handle activations and
    deactivations of entities do not cope with such a hybrid state (and
    would need to become more complex to cope).

    This commit addresses this issue by just making
    __bfq_entity_update_weight_prio() not perform also a possible pending
    change of ioprio class, when invoked on an I/O-request dispatch for a
    bfq_queue. Such a change is thus postponed to when
    __bfq_entity_update_weight_prio() is invoked on deactivation or
    re-scheduling of the bfq_queue.

    Reported-by: Marco Piazza
    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     
  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     
  • Pull uuid subsystem from Christoph Hellwig:
    "This is the new uuid subsystem, in which Amir, Andy and I have started
    consolidating our uuid/guid helpers and improving the types used for
    them. Note that various other subsystems have pulled in this tree, so
    I'd like it to go in early.

    UUID/GUID summary:

    - introduce the new uuid_t/guid_t types that are going to replace the
    somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)

    - consolidated generic uuid/guid helper functions lifted from XFS and
    libnvdimm (Amir and me)

    - conversions to the new types and helpers (Amir, Andy and me)"

    * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
    ACPI: hns_dsaf_acpi_dsm_guid can be static
    mmc: sdhci-pci: make guid intel_dsm_guid static
    uuid: Take const on input of uuid_is_null() and guid_is_null()
    thermal: int340x_thermal: fix compile after the UUID API switch
    thermal: int340x_thermal: Switch to use new generic UUID API
    acpi: always include uuid.h
    ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
    ACPI / extlog: Switch to use new generic UUID API
    ACPI / bus: Switch to use new generic UUID API
    ACPI / APEI: Switch to use new generic UUID API
    acpi, nfit: Switch to use new generic UUID API
    MAINTAINERS: add uuid entry
    tmpfs: generate random sb->s_uuid
    scsi_debug: switch to uuid_t
    nvme: switch to uuid_t
    sysctl: switch to use uuid_t
    partitions/ldm: switch to use uuid_t
    overlayfs: use uuid_t instead of uuid_be
    fs: switch ->s_uuid to uuid_t
    ima/policy: switch to use uuid_t
    ...

    Linus Torvalds
     

30 Jun, 2017

2 commits


29 Jun, 2017

4 commits

  • This patch performs sequential mapping between CPUs and queues.
    In case the system has more CPUs than HWQs then there are still
    CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
    and their siblings to the same HWQ.
    This actually fixes a bug that found unmapped HWQs in a system with
    2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
    running NVMEoF (opens upto maximum of 64 HWQs).

    Performance results running fio (72 jobs, 128 iodepth)
    using null_blk (w/w.o patch):

    bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24)
    ----- ---------------------------- ------------------------------ ---------------------------- -----------------------------
    512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K
    1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K
    2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K
    4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K
    8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K
    16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K
    32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K
    64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K
    128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we only create hctx for online CPUs, which can lead to a lot
    of churn due to frequent soft offline / online operations. Instead
    allocate one for each present CPU to avoid this and dramatically simplify
    the code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • This way we get a nice distribution independent of the current cpu
    online / offline state.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

28 Jun, 2017

1 commit

  • This commit fixes a bug triggered by a non-trivial sequence of
    events. These events are briefly described in the next two
    paragraphs. The impatiens, or those who are familiar with queue
    merging and splitting, can jump directly to the last paragraph.

    On each I/O-request arrival for a shared bfq_queue, i.e., for a
    bfq_queue that is the result of the merge of two or more bfq_queues,
    BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
    many random I/O requests have arrived for the bfq_queue; if the device
    is non rotational, then random requests must be also small for the
    bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
    detected as seeky, then a split occurs: the bfq I/O context of the
    process that has issued the request is redirected from the shared
    bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
    shared bfq_queue actually happens to be shared only by one process
    (because of previous splits), then no new bfq_queue is created: the
    state of the shared bfq_queue is just changed from shared to non
    shared.

    Regardless of whether a brand new non-shared bfq_queue is created, or
    the pre-existing shared bfq_queue is just turned into a non-shared
    bfq_queue, several parameters of the non-shared bfq_queue are set
    (restored) to the original values they had when the bfq_queue
    associated with the bfq I/O context of the process (that has just
    issued an I/O request) was merged with the shared bfq_queue. One of
    these parameters is the weight-raising state.

    If, on the split of a shared bfq_queue,
    1) a pre-existing shared bfq_queue is turned into a non-shared
    bfq_queue;
    2) the previously shared bfq_queue happens to be busy;
    3) the weight-raising state of the previously shared bfq_queue happens
    to change;
    the number of weight-raised busy queues changes. The field
    wr_busy_queues must then be updated accordingly, but such an update
    was missing. This commit adds the missing update.

    Reported-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente