04 Jul, 2017

4 commits

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     
  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     
  • Pull uuid subsystem from Christoph Hellwig:
    "This is the new uuid subsystem, in which Amir, Andy and I have started
    consolidating our uuid/guid helpers and improving the types used for
    them. Note that various other subsystems have pulled in this tree, so
    I'd like it to go in early.

    UUID/GUID summary:

    - introduce the new uuid_t/guid_t types that are going to replace the
    somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)

    - consolidated generic uuid/guid helper functions lifted from XFS and
    libnvdimm (Amir and me)

    - conversions to the new types and helpers (Amir, Andy and me)"

    * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
    ACPI: hns_dsaf_acpi_dsm_guid can be static
    mmc: sdhci-pci: make guid intel_dsm_guid static
    uuid: Take const on input of uuid_is_null() and guid_is_null()
    thermal: int340x_thermal: fix compile after the UUID API switch
    thermal: int340x_thermal: Switch to use new generic UUID API
    acpi: always include uuid.h
    ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
    ACPI / extlog: Switch to use new generic UUID API
    ACPI / bus: Switch to use new generic UUID API
    ACPI / APEI: Switch to use new generic UUID API
    acpi, nfit: Switch to use new generic UUID API
    MAINTAINERS: add uuid entry
    tmpfs: generate random sb->s_uuid
    scsi_debug: switch to uuid_t
    nvme: switch to uuid_t
    sysctl: switch to use uuid_t
    partitions/ldm: switch to use uuid_t
    overlayfs: use uuid_t instead of uuid_be
    fs: switch ->s_uuid to uuid_t
    ima/policy: switch to use uuid_t
    ...

    Linus Torvalds
     

29 Jun, 2017

4 commits

  • This patch performs sequential mapping between CPUs and queues.
    In case the system has more CPUs than HWQs then there are still
    CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
    and their siblings to the same HWQ.
    This actually fixes a bug that found unmapped HWQs in a system with
    2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
    running NVMEoF (opens upto maximum of 64 HWQs).

    Performance results running fio (72 jobs, 128 iodepth)
    using null_blk (w/w.o patch):

    bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24)
    ----- ---------------------------- ------------------------------ ---------------------------- -----------------------------
    512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K
    1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K
    2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K
    4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K
    8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K
    16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K
    32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K
    64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K
    128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we only create hctx for online CPUs, which can lead to a lot
    of churn due to frequent soft offline / online operations. Instead
    allocate one for each present CPU to avoid this and dramatically simplify
    the code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • This way we get a nice distribution independent of the current cpu
    online / offline state.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

28 Jun, 2017

10 commits

  • This commit fixes a bug triggered by a non-trivial sequence of
    events. These events are briefly described in the next two
    paragraphs. The impatiens, or those who are familiar with queue
    merging and splitting, can jump directly to the last paragraph.

    On each I/O-request arrival for a shared bfq_queue, i.e., for a
    bfq_queue that is the result of the merge of two or more bfq_queues,
    BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
    many random I/O requests have arrived for the bfq_queue; if the device
    is non rotational, then random requests must be also small for the
    bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
    detected as seeky, then a split occurs: the bfq I/O context of the
    process that has issued the request is redirected from the shared
    bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
    shared bfq_queue actually happens to be shared only by one process
    (because of previous splits), then no new bfq_queue is created: the
    state of the shared bfq_queue is just changed from shared to non
    shared.

    Regardless of whether a brand new non-shared bfq_queue is created, or
    the pre-existing shared bfq_queue is just turned into a non-shared
    bfq_queue, several parameters of the non-shared bfq_queue are set
    (restored) to the original values they had when the bfq_queue
    associated with the bfq I/O context of the process (that has just
    issued an I/O request) was merged with the shared bfq_queue. One of
    these parameters is the weight-raising state.

    If, on the split of a shared bfq_queue,
    1) a pre-existing shared bfq_queue is turned into a non-shared
    bfq_queue;
    2) the previously shared bfq_queue happens to be busy;
    3) the weight-raising state of the previously shared bfq_queue happens
    to change;
    the number of weight-raised busy queues changes. The field
    wr_busy_queues must then be updated accordingly, but such an update
    was missing. This commit adds the missing update.

    Reported-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Instead move it to the callers. Those that either don't use bio_data() or
    page_address() or are specific to architectures that do not support highmem
    are skipped.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • And just move it into scsi_transport_sas which needs it due to low-level
    drivers directly derferencing bio_data, and into blk_init_queue_node,
    which will need a further push into the callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • For historical reasons we default to bouncing highmem pages for all block
    queues. But the blk-mq drivers are easy to audit to ensure that we don't
    need this - scsi and mtip32xx set explicit limits and everyone else doesn't
    have any particular ones.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only call blk_queue_bounce for request-based drivers, so stop messing
    with it for make_request based drivers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Only used inside the bounce code, and opencoding it makes it more obvious
    what is going on.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This makes moves the knowledge about bouncing out of the callers into the
    block core (just like we do for the normal I/O path), and allows to unexport
    blk_queue_bounce.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Useful to verify that things are working the way they should.
    Reading the file will return number of kb written with each
    write hint. Writing the file will reset the statistics. No care
    is taken to ensure that we don't race on updates.

    Drivers will write to q->write_hints[] if they handle a given
    write hint.

    Reviewed-by: Andreas Dilger
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • No functional changes in this patch, we just use up some holes
    in the bio and request structures to define a write hint that
    we psas down the stack.

    Ensure that we don't merge requests that have different life time
    hints assigned to them, and that we inherit the write hint when
    cloning a bio.

    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2017

1 commit


23 Jun, 2017

1 commit


22 Jun, 2017

6 commits

  • hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
    set it again.

    Signed-off-by: weiping
    Signed-off-by: Jens Axboe

    weiping
     
  • Since blk_mq_quiesce_queue_nowait() can be called from interrupt
    context, make this safe. Since this function is not in the hot
    path, uninline it.

    Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
    Signed-off-by: Bart Van Assche
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Cc: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This was detected by the smatch static analyzer.

    Fixes: commit 2a842acab109 ("block: introduce new block status code type")
    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Avoid that building with W=1 causes the compiler to complain that
    a declaration for bounce_bio_set and bounce_bio_split is missing.

    References: commit a8821f3f32be ("block: Improvements to bounce-buffer handling")
    Signed-off-by: Bart Van Assche
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch suppresses gcc 7 warnings about falling through in switch
    statements when building with W=1. From the gcc documentation: The
    -Wimplicit-fallthrough=3 warning is enabled by -Wextra. See also
    https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Warning-Options.html.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • If we have shared tags enabled, then every IO completion will trigger
    a full loop of every queue belonging to a tag set, and every hardware
    queue for each of those queues, even if nothing needs to be done.
    This causes a massive performance regression if you have a lot of
    shared devices.

    Instead of doing this huge full scan on every IO, add an atomic
    counter to the main queue that tracks how many hardware queues have
    been marked as needing a restart. With that, we can avoid looking for
    restartable queues, if we don't have to.

    Max reports that this restores performance. Before this patch, 4K
    IOPS was limited to 22-23K IOPS. With the patch, we are running at
    950-970K IOPS.

    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Reported-by: Max Gurtovoy
    Tested-by: Max Gurtovoy
    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jun, 2017

11 commits

  • A queue must be frozen while the mapped state of a hardware queue
    is changed. Additionally, any change of the mapped state is
    followed by a call to blk_mq_map_swqueue() (see also
    blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
    Since blk_mq_map_swqueue() does not map any unmapped hardware
    queue onto any software queue, no attempt will be made to run
    an unmapped hardware queue. Hence issue a warning upon attempts
    to run an unmapped hardware queue.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • The variable 'disk_type' is never modified so constify it.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Document the locking assumptions in functions that modify
    blk_mq_ctx.rq_list to make it easier for humans to verify
    this code.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Some functions in block/blk-core.c must only be used on blk-sq queues
    while others are safe to use against any queue type. Document which
    functions are intended for blk-sq queues and issue a warning if the
    blk-sq API is misused. This does not only help block driver authors
    but will also make it easier to remove the blk-sq code once that code
    is declared obsolete.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of documenting the locking assumptions of most block layer
    functions as a comment, use lockdep_assert_held() to verify locking
    assumptions at runtime.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
    is called after a value has been assigned to .rq_flags and .rq_flags
    is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
    blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
    Moving the initialization of .rq_flags is fine because all changes
    and tests of .rq_flags occur between blk_get_request() and finishing
    a request.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Since scsi_req_init() works on a struct scsi_request, change the
    argument type into struct scsi_request *.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of explicitly calling scsi_req_init() after blk_get_request(),
    call that function from inside blk_get_request(). Add an
    .initialize_rq_fn() callback function to the block drivers that need
    it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
    because it is too small to keep it as a separate function. Keep the
    scsi_req_init() call in ide_prep_sense() because it follows a
    blk_rq_init() call.

    References: commit 82ed4db499b8 ("block: split scsi_request out of struct request")
    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Nicholas Bellinger
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Several block drivers need to initialize the driver-private request
    data after having called blk_get_request() and before .prep_rq_fn()
    is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that
    that initialization code has to be repeated after every
    blk_get_request() call by adding new callback functions to struct
    request_queue and to struct blk_mq_ops.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of declaring the second argument of blk_*_get_request()
    as int and passing it to functions that expect an unsigned int,
    declare that second argument as unsigned int. Also because of
    consistency, rename that second argument from 'rw' into 'op'.
    This patch does not change any functionality.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Since the srcu structure is rather large (184 bytes on an x86-64
    system with kernel debugging disabled), only allocate it if needed.

    Reported-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Jun, 2017

3 commits

  • A new bio operation flag REQ_NOWAIT is introduced to identify bio's
    orignating from iocb with IOCB_NOWAIT. This flag indicates
    to return immediately if a request cannot be made instead
    of retrying.

    Stacked devices such as md (the ones with make_request_fn hooks)
    currently are not supported because it may block for housekeeping.
    For example, an md can have a part of the device suspended.
    For this reason, only request based devices are supported.
    In the future, this feature will be expanded to stacked devices
    by teaching them how to handle the REQ_NOWAIT flags.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     
  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar