14 Dec, 2014

1 commit

  • Pull block driver core update from Jens Axboe:
    "This is the pull request for the core block IO changes for 3.19. Not
    a huge round this time, mostly lots of little good fixes:

    - Fix a bug in sysfs blktrace interface causing a NULL pointer
    dereference, when enabled/disabled through that API. From Arianna
    Avanzini.

    - Various updates/fixes/improvements for blk-mq:

    - A set of updates from Bart, mostly fixing buts in the tag
    handling.

    - Cleanup/code consolidation from Christoph.

    - Extend queue_rq API to be able to handle batching issues of IO
    requests. NVMe will utilize this shortly. From me.

    - A few tag and request handling updates from me.

    - Cleanup of the preempt handling for running queues from Paolo.

    - Prevent running of unmapped hardware queues from Ming Lei.

    - Move the kdump memory limiting check to be in the correct
    location, from Shaohua.

    - Initialize all software queues at init time from Takashi. This
    prevents a kobject warning when CPUs are brought online that
    weren't online when a queue was registered.

    - Single writeback fix for I_DIRTY clearing from Tejun. Queued with
    the core IO changes, since it's just a single fix.

    - Version X of the __bio_add_page() segment addition retry from
    Maurizio. Hope the Xth time is the charm.

    - Documentation fixup for IO scheduler merging from Jan.

    - Introduce (and use) generic IO stat accounting helpers for non-rq
    drivers, from Gu Zheng.

    - Kill off artificial limiting of max sectors in a request from
    Christoph"

    * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    blk-mq: Fix uninitialized kobject at CPU hotplugging
    blktrace: don't let the sysfs interface remove trace from running list
    blk-mq: Use all available hardware queues
    blk-mq: Micro-optimize bt_get()
    blk-mq: Fix a race between bt_clear_tag() and bt_get()
    blk-mq: Avoid that __bt_get_word() wraps multiple times
    blk-mq: Fix a use-after-free
    blk-mq: prevent unmapped hw queue from being scheduled
    blk-mq: re-check for available tags after running the hardware queue
    blk-mq: fix hang in bt_get()
    blk-mq: move the kdump check to blk_mq_alloc_tag_set
    blk-mq: cleanup tag free handling
    blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq cpu map
    blk: introduce generic io stat accounting help function
    blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
    genhd: check for int overflow in disk_expand_part_tbl()
    blk-mq: add blk_mq_free_hctx_request()
    blk-mq: export blk_mq_free_request()
    blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
    ...

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:
    "This time we have some more new material than we used to have during
    the last couple of development cycles.

    The most important part of it to me is the introduction of a unified
    interface for accessing device properties provided by platform
    firmware. It works with Device Trees and ACPI in a uniform way and
    drivers using it need not worry about where the properties come from
    as long as the platform firmware (either DT or ACPI) makes them
    available. It covers both devices and "bare" device node objects
    without struct device representation as that turns out to be necessary
    in some cases. This has been in the works for quite a few months (and
    development cycles) and has been approved by all of the relevant
    maintainers.

    On top of that, some drivers are switched over to the new interface
    (at25, leds-gpio, gpio_keys_polled) and some additional changes are
    made to the core GPIO subsystem to allow device drivers to manipulate
    GPIOs in the "canonical" way on platforms that provide GPIO
    information in their ACPI tables, but don't assign names to GPIO lines
    (in which case the driver needs to do that on the basis of what it
    knows about the device in question). That also has been approved by
    the GPIO core maintainers and the rfkill driver is now going to use
    it.

    Second is support for hardware P-states in the intel_pstate driver.
    It uses CPUID to detect whether or not the feature is supported by the
    processor in which case it will be enabled by default. However, it
    can be disabled entirely from the kernel command line if necessary.

    Next is support for a platform firmware interface based on ACPI
    operation regions used by the PMIC (Power Management Integrated
    Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
    That interface is used for manipulating power resources and for
    thermal management: sensor temperature reporting, trip point setting
    and so on.

    Also the ACPI core is now going to support the _DEP configuration
    information in a limited way. Basically, _DEP it supposed to reflect
    off-the-hierarchy dependencies between devices which may be very
    indirect, like when AML for one device accesses locations in an
    operation region handled by another device's driver (usually, the
    device depended on this way is a serial bus or GPIO controller). The
    support added this time is sufficient to make the ACPI battery driver
    work on Asus T100A, but it is general enough to be able to cover some
    other use cases in the future.

    Finally, we have a new cpufreq driver for the Loongson1B processor.

    In addition to the above, there are fixes and cleanups all over the
    place as usual and a traditional ACPICA update to a recent upstream
    release.

    As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for
    Intel platforms should be able to handle power management of the DMA
    engine correctly, the cpufreq-dt driver should interact with the
    thermal subsystem in a better way and the ACPI backlight driver should
    handle some more corner cases, among other things.

    On top of the ACPICA update there are fixes for race conditions in the
    ACPICA's interrupt handling code which might lead to some random and
    strange looking failures on some systems.

    In the cleanups department the most visible part is the series of
    commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration
    option. That was triggered by a discussion regarding the generic
    power domains code during which we realized that trying to support
    certain combinations of PM config options was painful and not really
    worth it, because nobody would use them in production anyway. For
    this reason, we decided to make CONFIG_PM_SLEEP select
    CONFIG_PM_RUNTIME and that lead to the conclusion that the latter
    became redundant and CONFIG_PM could be used instead of it. The
    material here makes that replacement in a major part of the tree, but
    there will be at least one more batch of that in the second part of
    the merge window.

    Specifics:

    - Support for retrieving device properties information from ACPI _DSD
    device configuration objects and a unified device properties
    interface for device drivers (and subsystems) on top of that. As
    stated above, this works with Device Trees and ACPI and allows
    device drivers to be written in a platform firmware (DT or ACPI)
    agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are
    now going to use this new interface and the GPIO subsystem is
    additionally modified to allow device drivers to assign names to
    GPIO resources returned by ACPI _CRS objects (in case _DSD is not
    present or does not provide the expected data). The changes in
    this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron
    Lu, and Darren Hart with some fixes from others (Fabio Estevam,
    Geert Uytterhoeven).

    - Support for Hardware Managed Performance States (HWP) as described
    in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
    driver. CPUID is used to detect whether or not the feature is
    supported by the processor. If supported, it will be enabled
    automatically unless the intel_pstate=no_hwp switch is present in
    the kernel command line. From Dirk Brandewie.

    - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).

    - Support for firmware interface based on ACPI operation regions used
    by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
    platforms for power resource control and thermal management (Aaron
    Lu).

    - Limited support for retrieving off-the-hierarchy dependencies
    between devices from ACPI _DEP device configuration objects and
    deferred probing support for the ACPI battery driver based on the
    _DEP information to make that driver work on Asus T100A (Lan
    Tianyu).

    - New cpufreq driver for the Loongson1B processor (Kelvin Cheung).

    - ACPICA update to upstream revision 20141107 which only affects
    tools (Bob Moore).

    - Fixes for race conditions in the ACPICA's interrupt handling code
    and in the ACPI code related to system suspend and resume (Lv Zheng
    and Rafael J Wysocki).

    - ACPI core fix for an RCU-related issue in the ioremap() regions
    management code that slowed down significantly after CPUs had been
    allowed to enter idle states even if they'd had RCU callbakcs
    queued and triggered some problems in certain proprietary graphics
    driver (and elsewhere). The fix replaces synchronize_rcu() in that
    code with synchronize_rcu_expedited() which makes the issue go
    away. From Konstantin Khlebnikov.

    - ACPI LPSS (Low-Power Subsystem) driver fix to handle power
    management of the DMA engine included into the LPSS correctly. The
    problem is that the DMA engine doesn't have ACPI PM support of its
    own and it simply is turned off when the last LPSS device having
    ACPI PM support goes into D3cold. To work around that, the PM
    domain used by the ACPI LPSS driver is redesigned so at least one
    device with ACPI PM support will be on as long as the DMA engine is
    in use. From Andy Shevchenko.

    - ACPI backlight driver fix to avoid using it on "Win8-compatible"
    systems where it doesn't work and where it was used by default by
    mistake (Aaron Lu).

    - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
    Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin
    Chaugule (mostly related to the upcoming ARM64 support).

    - Intel RAPL (Running Average Power Limit) power capping driver fixes
    and improvements including new processor IDs (Jacob Pan).

    - Generic power domains modification to power up domains after
    attaching devices to them to meet the expectations of device
    drivers and bus types assuming devices to be accessible at probe
    time (Ulf Hansson).

    - Preliminary support for controlling device clocks from the generic
    power domains core code and modifications of the ARM/shmobile
    platform to use that feature (Ulf Hansson).

    - Assorted minor fixes and cleanups of the generic power domains core
    code (Ulf Hansson, Geert Uytterhoeven).

    - Assorted minor fixes and cleanups of the device clocks control code
    in the PM core (Geert Uytterhoeven, Grygorii Strashko).

    - Consolidation of device power management Kconfig options by making
    CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
    which is now redundant (Rafael J Wysocki and Kevin Hilman). That
    is the first batch of the changes needed for this purpose.

    - Core device runtime power management support code cleanup related
    to the execution of callbacks (Andrzej Hajda).

    - cpuidle ARM support improvements (Lorenzo Pieralisi).

    - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a
    new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
    Bartlomiej Zolnierkiewicz).

    - New cpufreq driver callback (->ready) to be executed when the
    cpufreq core is ready to use a given policy object and cpufreq-dt
    driver modification to use that callback for cooling device
    registration (Viresh Kumar).

    - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James
    Geboski, Tomeu Vizoso).

    - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
    cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
    Stefan Wahren, Petr Cvek).

    - OPP (Operating Performance Points) framework modification to allow
    OPPs to be removed too and update of a few cpufreq drivers
    (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
    during initialization) on driver removal (Viresh Kumar).

    - Hibernation core fixes and cleanups (Tina Ruchandani and Markus
    Elfring).

    - PM Kconfig fix related to CPU power management (Pankaj Dubey).

    - cpupower tool fix (Prarit Bhargava)"

    * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits)
    i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c
    dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    tools: cpupower: fix return checks for sysfs_get_idlestate_count()
    drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME
    MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    leds: leds-gpio: Fix multiple instances registration without 'label' property
    iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hwrandom / exynos / PM: Use CONFIG_PM in #ifdef
    block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    USB / PM: Drop CONFIG_PM_RUNTIME from the USB core
    PM: Merge the SET*_RUNTIME_PM_OPS() macros
    ...

    Linus Torvalds
     

10 Dec, 2014

1 commit

  • blk-mq users are allowed to free the memory request_queue.tag_set
    points at after blk_cleanup_queue() has finished but before
    blk_release_queue() has started. This can happen e.g. in the SCSI
    core. The SCSI core namely embeds the tag_set structure in a SCSI
    host structure. The SCSI host structure is freed by
    scsi_host_dev_release(). This function is called after
    blk_cleanup_queue() finished but can be called before
    blk_release_queue().

    This means that it is not safe to access request_queue.tag_set from
    inside blk_release_queue(). Hence remove the blk_sync_queue() call
    from blk_release_queue(). This call is not necessary - outstanding
    requests must have finished before blk_release_queue() is
    called. Additionally, move the blk_mq_free_queue() call from
    blk_release_queue() to blk_cleanup_queue() to avoid that struct
    request_queue.tag_set gets accessed after it has been freed.

    This patch avoids that the following kernel oops can be triggered
    when deleting a SCSI host for which scsi-mq was enabled:

    Call Trace:
    [] lock_acquire+0xc4/0x270
    [] mutex_lock_nested+0x61/0x380
    [] blk_mq_free_queue+0x30/0x180
    [] blk_release_queue+0x84/0xd0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] blk_put_queue+0x15/0x20
    [] disk_release+0x99/0xd0
    [] device_release+0x36/0xb0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] put_disk+0x1a/0x20
    [] __blkdev_put+0x135/0x1b0
    [] blkdev_put+0x50/0x160
    [] kill_block_super+0x44/0x70
    [] deactivate_locked_super+0x44/0x60
    [] deactivate_super+0x4e/0x70
    [] cleanup_mnt+0x43/0x90
    [] __cleanup_mnt+0x12/0x20
    [] task_work_run+0xac/0xe0
    [] do_notify_resume+0x61/0xa0
    [] int_signal+0x12/0x17

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 Dec, 2014

1 commit

  • After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
    selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks
    depending on CONFIG_PM_RUNTIME may now be changed to depend on
    CONFIG_PM.

    Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core.

    Reviewed-by: Aaron Lu
    Acked-by: Jens Axboe
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

12 Nov, 2014

1 commit

  • Currently scsi piggy backs on the block layer to define the concept
    of a tagged command. But we want to be able to have block-level host-wide
    tags assigned even for untagged commands like the initial INQUIRY, so add
    a new SCSI-level flag for commands that are tagged at the scsi level, so
    that even commands without that set can have tags assigned to them. Note
    that this alredy is the case for the blk-mq code path, and this just lets
    the old path catch up with it.

    We also set this flag based upon sdev->simple_tags instead of the block
    queue flag, so that it is entirely independent of the block layer tagging,
    and thus always correct even if a driver doesn't use block level tagging
    yet.

    Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and
    removing it forces them to look for the proper replacement.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Christie
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke

    Christoph Hellwig
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

13 Oct, 2014

2 commits

  • In __get_request calls to printk_ratelimited, include the function name so
    the callbacks suppressed message matches the messages that are printed,
    and add "dev" before the device name so it matches other block layer
    messages.

    Signed-off-by: Robert Elliott
    Reviewed-by: Webb Scales
    Signed-off-by: Jens Axboe

    Robert Elliott
     
  • In blk_update_request, change the printk_ratelimited
    prefix from end_request to blk_update_request so it
    matches the name printed if rate limiting occurs.

    Old:
    [10234.933106] blk_update_request: 174 callbacks suppressed
    [10234.934940] end_request: critical target error, dev sdr, sector 16
    [10234.949788] end_request: critical target error, dev sdr, sector 16

    New:
    [16863.445173] blk_update_request: 398 callbacks suppressed
    [16863.447029] blk_update_request: critical target error, dev sdr, sector
    1442066176
    [16863.449383] blk_update_request: critical target error, dev sdr, sector
    802802888
    [16863.451680] blk_update_request: critical target error, dev sdr, sector
    1609535456

    Signed-off-by: Robert Elliott
    Reviewed-by: Webb Scales
    Signed-off-by: Jens Axboe

    Robert Elliott
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

04 Oct, 2014

1 commit

  • Request cloning clones bios in the request to track the completion
    of each bio.
    For that purpose, we can use bio_clone_fast() instead of bio_clone()
    to avoid unnecessary allocation and copy of bvecs.

    This patch reduces memory footprint of request-based device-mapper
    (about 1-4KB for each request) and is a preparation for further
    reduction of memory usage by removing unused bvec mempool.

    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

01 Oct, 2014

1 commit


26 Sep, 2014

6 commits

  • This patch supports to run one single flush machinery for
    each blk-mq dispatch queue, so that:

    - current init_request and exit_request callbacks can
    cover flush request too, then the buggy copying way of
    initializing flush request's pdu can be fixed

    - flushing performance gets improved in case of multi hw-queue

    In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
    iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
    increased a lot over my test environment:
    - throughput: +70% in case of virtio-blk over null_blk
    - throughput: +30% in case of virtio-blk over SSD image

    The multi virtqueue feature isn't merged to QEMU yet, and patches for
    the feature can be found in below tree:

    git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

    And simply passing 'num_queues=4 vectors=5' should be enough to
    enable multi queue(quad queue) feature for QEMU virtio-blk.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
    so that this function can find the corresponding blk_flush_queue
    bound with current mq context since the flush queue will become
    per hw-queue.

    For legacy queue, the parameter can be simply 'NULL'.

    For multiqueue case, the parameter should be set as the context
    from which the related request is originated. With this context
    info, the hw queue and related flush queue can be found easily.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now mission of the two helpers is over, and just call
    blk_alloc_flush_queue() and blk_free_flush_queue() directly.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch introduces 'struct blk_flush_queue' and puts all
    flush machinery related fields into this structure, so that

    - flush implementation details aren't exposed to driver
    - it is easy to convert to per dispatch-queue flush machinery

    This patch is basically a mechanical replacement.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These fields are always used with the flush request, so
    initialize them together.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

11 Sep, 2014

1 commit


09 Sep, 2014

2 commits

  • This patch fix spelling typo found in DocBook/kernel-api.xml.
    It is because the file is generated from the source comments,
    I have to fix the comments in source codes.

    Signed-off-by: Masanari Iida
    Acked-by: Randy Dunlap
    Signed-off-by: Jiri Kosina

    Masanari Iida
     
  • bdev_get_queue() returns the request_queue associated with the
    specified block_device. blk_get_backing_dev_info() makes use of
    bdev_get_queue() to determine the associated bdi given a block_device.

    All the callers of bdev_get_queue() including
    blk_get_backing_dev_info() assume that bdev_get_queue() may return
    NULL and implement NULL handling; however, bdev_get_queue() requires
    the passed in block_device is opened and attached to its gendisk.
    Because an active gendisk always has a valid request_queue associated
    with it, bdev_get_queue() can never return NULL and neither can
    blk_get_backing_dev_info().

    Make it clear that neither of the two functions can return NULL and
    remove NULL handling from all the callers.

    Signed-off-by: Tejun Heo
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 Aug, 2014

1 commit

  • The blk_get_request function may fail in low-memory conditions or during
    device removal (even if __GFP_WAIT is set). To distinguish between these
    errors, modify the blk_get_request call stack to return the appropriate
    ERR_PTR. Verify that all callers check the return status and consider
    IS_ERR instead of a simple NULL pointer check.

    For consistency, make a similar change to the blk_mq_alloc_request leg
    of blk_get_request. It may fail if the queue is dead, or the caller was
    unwilling to wait.

    Signed-off-by: Joe Lawrence
    Acked-by: Jiri Kosina [for pktdvd]
    Acked-by: Boaz Harrosh [for osd]
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Joe Lawrence
     

23 Aug, 2014

1 commit

  • This patch fixes code such as the following with scsi-mq enabled:

    rq = blk_get_request(...);
    blk_rq_set_block_pc(rq);

    rq->cmd = my_cmd_buffer; /* separate CDB buffer */

    blk_execute_rq_nowait(...);

    Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for
    large CDBs only). Without this patch, scsi_mq_prep_fn() will set
    rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device.

    Signed-off-by: Tony Battersby
    Signed-off-by: Jens Axboe

    Tony Battersby
     

02 Jul, 2014

2 commits

  • blk_mq freezing is entangled with generic bypassing which bypasses
    blkcg and io scheduler and lets IO requests fall through the block
    layer to the drivers in FIFO order. This allows forward progress on
    IOs with the advanced features disabled so that those features can be
    configured or altered without worrying about stalling IO which may
    lead to deadlock through memory allocation.

    However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
    doesn't make use of blkcg or ioscheds and it maps bypssing to
    freezing, which blocks request processing and drains all the in-flight
    ones. This causes problems as bypassing assumes that request
    processing is online. blk-mq works around this by conditionally
    allowing request processing for the problem case - during queue
    initialization.

    Another weirdity is that except for during queue cleanup, bypassing
    started on the generic side prevents blk-mq from processing new
    requests but doesn't drain the in-flight ones. This shouldn't break
    anything but again highlights that something isn't quite right here.

    The root cause is conflating blk-mq freezing and generic bypassing
    which are two different mechanisms. The only intersecting purpose
    that they serve is during queue cleanup. Let's properly separate
    blk-mq freezing from generic bypassing and simply use it where
    necessary.

    * request_queue->mq_freeze_depth is added and
    blk_mq_[un]freeze_queue() now operate on this counter instead of
    ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
    but the counter is tested directly. This will be further updated by
    later changes.

    * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
    blk_mq_freeze_queue(). Queue cleanup path now calls
    blk_mq_freeze_queue() directly.

    * blk_queue_enter()'s fast path condition is simplified to simply
    check @q->mq_freeze_depth. Previously, the condition was

    !blk_queue_dying(q) &&
    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

    mq_freeze_depth is incremented right after dying is set and
    blk_queue_init_done() exception isn't necessary as blk-mq doesn't
    start frozen, which only leaves the blk_queue_bypass() test which
    can be replaced by @q->mq_freeze_depth test.

    This change simplifies the code and reduces confusion in the area.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
    skip queue draining if bypass_depth was already above zero. The
    assumption is that the one which bumped the bypass_depth should have
    performed draining already; however, there's nothing which prevents a
    new instance of bypassing/freezing from starting before the previous
    one finishes draining. The current code may allow the later
    bypassing/freezing instances to complete while there still are
    in-flight requests which haven't finished draining.

    Fix it by draining regardless of bypass_depth. We still skip draining
    from blk_queue_bypass_start() while the queue is initializing to avoid
    introducing excessive delays during boot. INIT_DONE setting is moved
    above the initial blk_queue_bypass_end() so that bypassing attempts
    can't slip inbetween.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

20 Jun, 2014

1 commit

  • Pull block fixes from Jens Axboe:
    "A smaller collection of fixes for the block core that would be nice to
    have in -rc2. This pull request contains:

    - Fixes for races in the wait/wakeup logic used in blk-mq from
    Alexander. No issues have been observed, but it is definitely a
    bit flakey currently. Alternatively, we may drop the cyclic
    wakeups going forward, but that needs more testing.

    - Some cleanups from Christoph.

    - Fix for an oops in null_blk if queue_mode=1 and softirq completions
    are used. From me.

    - A fix for a regression caused by the chunk size setting. It
    inadvertently used max_hw_sectors instead of max_sectors, which is
    incorrect, and causes hangs on btrfs multi-disk setups (where hw
    sectors apparently isn't set). From me.

    - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a
    recent addition as well, but it actually breaks blk-mq which relies
    on strict scheduling. If the workqueue power_efficient mode is
    turned on, this breaks blk-mq. From Matias.

    - null_blk module parameter description fix from Mike"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: bitmap tag: fix races in bt_get() function
    blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
    blk-mq: bitmap tag: fix races on shared ::wake_index fields
    block: blk_max_size_offset() should check ->max_sectors
    null_blk: fix softirq completions for queue_mode == 1
    blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
    blk-mq: properly drain stopped queues
    block: remove WQ_POWER_EFFICIENT from kblockd
    null_blk: fix name and description of 'queue_mode' module parameter
    block: remove elv_abort_queue and blk_abort_flushes

    Linus Torvalds
     

16 Jun, 2014

1 commit

  • Pull NVMe update from Matthew Wilcox:
    "Mostly bugfixes again for the NVMe driver. I'd like to call out the
    exported tracepoint in the block layer; I believe Keith has cleared
    this with Jens.

    We've had a few reports from people who're really pounding on NVMe
    devices at scale, hence the timeout changes (and new module
    parameters), hotplug cpu deadlock, tracepoints, and minor performance
    tweaks"

    [ Jens hadn't seen that tracepoint thing, but is ok with it - it will
    end up going away when mq conversion happens ]

    * git://git.infradead.org/users/willy/linux-nvme: (22 commits)
    NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
    NVMe: Use Log Page constants in SCSI emulation
    NVMe: Define Log Page constants
    NVMe: Fix hot cpu notification dead lock
    NVMe: Rename io_timeout to nvme_io_timeout
    NVMe: Use last bytes of f/w rev SCSI Inquiry
    NVMe: Adhere to request queue block accounting enable/disable
    NVMe: Fix nvme get/put queue semantics
    NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
    NVMe: Make admin timeout a module parameter
    NVMe: Make iod bio timeout a parameter
    NVMe: Prevent possible NULL pointer dereference
    NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
    NVMe: Update data structures for NVMe 1.2
    NVMe: Enable BUILD_BUG_ON checks
    NVMe: Update namespace and controller identify structures to the 1.1a spec
    NVMe: Flush with data support
    NVMe: Configure support for block flush
    NVMe: Add tracepoints
    NVMe: Protect against badly formatted CQEs
    ...

    Linus Torvalds
     

12 Jun, 2014

1 commit

  • blk-mq issues async requests through kblockd. To issue a work request on
    a specific CPU, kblockd_schedule_delayed_work_on is used. However, the
    specific CPU choice may not be honored, if the power_efficient option
    for workqueues is set. blk-mq requires that we have strict per-cpu
    scheduling, so it wont work properly if kblockd is marked
    POWER_EFFICIENT and power_efficient is set.

    Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior.
    This essentially reverts part of commit 695588f9454b, which added
    the WQ_POWER_EFFICIENT marker to kblockd.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

06 Jun, 2014

1 commit

  • With the optimizations around not clearing the full request at alloc
    time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
    up to the user allocating the request.

    Add a blk_rq_set_block_pc() that sets the command type to
    REQ_TYPE_BLOCK_PC, and properly initializes the members associated
    with this type of request. Update callers to use this function instead
    of manipulating rq->cmd_type directly.

    Includes fixes from Christoph Hellwig for my half-assed
    attempt.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 May, 2014

1 commit


28 May, 2014

1 commit


27 May, 2014

1 commit


21 May, 2014

2 commits

  • In blk_mq_make_request(), do the blk_queue_nomerges() check
    outside the call to blk_attempt_plug_merge() to eliminate
    function call overhead when nomerges=2 (disabled)

    Signed-off-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Robert Elliott
     
  • For request_fn based devices, the block layer exports a 'nr_requests'
    file through sysfs to allow adjusting of queue depth on the fly.
    Currently this returns -EINVAL for blk-mq, since it's not wired up.
    Wire this up for blk-mq, so that it now also always dynamic
    adjustments of the allowed queue depth for any given block device
    managed by blk-mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 May, 2014

1 commit

  • We first check if we have inflight IO, then retrieve that
    same number again. Usually this isn't that costly since the
    chance of having the data dirtied in between is small, but
    there's no reason for calling part_in_flight() twice.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 May, 2014

1 commit

  • Adding tracepoints for bio_complete and block_split into nvme to help
    with gathering IO info using blktrace and blkparse.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     

17 Apr, 2014

2 commits


16 Apr, 2014

2 commits

  • This was used in the olden days, back when onions were proper
    yellow. Basically it mapped to the current buffer to be
    transferred. With highmem being added more than a decade ago,
    most drivers map pages out of a bio, and rq->buffer isn't
    pointing at anything valid.

    Convert old style drivers to just use bio_data().

    For the discard payload use case, just reference the page
    in the bio.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't like this, but things have diverged with the blk-mq fixes
    in 3.15-rc1. So merge it in.

    Jens Axboe
     

11 Apr, 2014

1 commit