15 Dec, 2014

1 commit

  • This reverts commit 52f7eb945f2ba62b324bb9ae16d945326a961dcf.

    The optimization is only really safe for a single queue, otherwise
    'bs' and 'bt' can indeed change, and if we don't do a finish_wait()
    for each loop, we'll potentially change the wait structure and
    corrupt task wait list.

    Reported-by: Jan Kara

    Jens Axboe
     

14 Dec, 2014

1 commit

  • Pull block driver core update from Jens Axboe:
    "This is the pull request for the core block IO changes for 3.19. Not
    a huge round this time, mostly lots of little good fixes:

    - Fix a bug in sysfs blktrace interface causing a NULL pointer
    dereference, when enabled/disabled through that API. From Arianna
    Avanzini.

    - Various updates/fixes/improvements for blk-mq:

    - A set of updates from Bart, mostly fixing buts in the tag
    handling.

    - Cleanup/code consolidation from Christoph.

    - Extend queue_rq API to be able to handle batching issues of IO
    requests. NVMe will utilize this shortly. From me.

    - A few tag and request handling updates from me.

    - Cleanup of the preempt handling for running queues from Paolo.

    - Prevent running of unmapped hardware queues from Ming Lei.

    - Move the kdump memory limiting check to be in the correct
    location, from Shaohua.

    - Initialize all software queues at init time from Takashi. This
    prevents a kobject warning when CPUs are brought online that
    weren't online when a queue was registered.

    - Single writeback fix for I_DIRTY clearing from Tejun. Queued with
    the core IO changes, since it's just a single fix.

    - Version X of the __bio_add_page() segment addition retry from
    Maurizio. Hope the Xth time is the charm.

    - Documentation fixup for IO scheduler merging from Jan.

    - Introduce (and use) generic IO stat accounting helpers for non-rq
    drivers, from Gu Zheng.

    - Kill off artificial limiting of max sectors in a request from
    Christoph"

    * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    blk-mq: Fix uninitialized kobject at CPU hotplugging
    blktrace: don't let the sysfs interface remove trace from running list
    blk-mq: Use all available hardware queues
    blk-mq: Micro-optimize bt_get()
    blk-mq: Fix a race between bt_clear_tag() and bt_get()
    blk-mq: Avoid that __bt_get_word() wraps multiple times
    blk-mq: Fix a use-after-free
    blk-mq: prevent unmapped hw queue from being scheduled
    blk-mq: re-check for available tags after running the hardware queue
    blk-mq: fix hang in bt_get()
    blk-mq: move the kdump check to blk_mq_alloc_tag_set
    blk-mq: cleanup tag free handling
    blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq cpu map
    blk: introduce generic io stat accounting help function
    blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
    genhd: check for int overflow in disk_expand_part_tbl()
    blk-mq: add blk_mq_free_hctx_request()
    blk-mq: export blk_mq_free_request()
    blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
    ...

    Linus Torvalds
     

12 Dec, 2014

1 commit

  • The original behaviour is to refuse to add a new page if the maximum
    number of segments has been reached, regardless of the fact the page we
    are going to add can be merged into the last segment or not.

    Unfortunately, when the system runs under heavy memory fragmentation
    conditions, a driver may try to add multiple pages to the last segment.
    The original code won't accept them and EBUSY will be reported to
    userspace.

    This patch modifies the function so it refuses to add a page only in case
    the latter starts a new segment and the maximum number of segments has
    already been reached.

    The bug can be easily reproduced with the st driver:

    1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
    2) modprobe st buffer_kbs=1024
    3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
    dd: error writing `/dev/st0': Device or resource busy

    Signed-off-by: Maurizio Lombardi
    Signed-off-by: Ming Lei
    Cc: Jet Chen
    Cc: Tomas Henzl
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Maurizio Lombardi
     

11 Dec, 2014

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:
    "This time we have some more new material than we used to have during
    the last couple of development cycles.

    The most important part of it to me is the introduction of a unified
    interface for accessing device properties provided by platform
    firmware. It works with Device Trees and ACPI in a uniform way and
    drivers using it need not worry about where the properties come from
    as long as the platform firmware (either DT or ACPI) makes them
    available. It covers both devices and "bare" device node objects
    without struct device representation as that turns out to be necessary
    in some cases. This has been in the works for quite a few months (and
    development cycles) and has been approved by all of the relevant
    maintainers.

    On top of that, some drivers are switched over to the new interface
    (at25, leds-gpio, gpio_keys_polled) and some additional changes are
    made to the core GPIO subsystem to allow device drivers to manipulate
    GPIOs in the "canonical" way on platforms that provide GPIO
    information in their ACPI tables, but don't assign names to GPIO lines
    (in which case the driver needs to do that on the basis of what it
    knows about the device in question). That also has been approved by
    the GPIO core maintainers and the rfkill driver is now going to use
    it.

    Second is support for hardware P-states in the intel_pstate driver.
    It uses CPUID to detect whether or not the feature is supported by the
    processor in which case it will be enabled by default. However, it
    can be disabled entirely from the kernel command line if necessary.

    Next is support for a platform firmware interface based on ACPI
    operation regions used by the PMIC (Power Management Integrated
    Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
    That interface is used for manipulating power resources and for
    thermal management: sensor temperature reporting, trip point setting
    and so on.

    Also the ACPI core is now going to support the _DEP configuration
    information in a limited way. Basically, _DEP it supposed to reflect
    off-the-hierarchy dependencies between devices which may be very
    indirect, like when AML for one device accesses locations in an
    operation region handled by another device's driver (usually, the
    device depended on this way is a serial bus or GPIO controller). The
    support added this time is sufficient to make the ACPI battery driver
    work on Asus T100A, but it is general enough to be able to cover some
    other use cases in the future.

    Finally, we have a new cpufreq driver for the Loongson1B processor.

    In addition to the above, there are fixes and cleanups all over the
    place as usual and a traditional ACPICA update to a recent upstream
    release.

    As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for
    Intel platforms should be able to handle power management of the DMA
    engine correctly, the cpufreq-dt driver should interact with the
    thermal subsystem in a better way and the ACPI backlight driver should
    handle some more corner cases, among other things.

    On top of the ACPICA update there are fixes for race conditions in the
    ACPICA's interrupt handling code which might lead to some random and
    strange looking failures on some systems.

    In the cleanups department the most visible part is the series of
    commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration
    option. That was triggered by a discussion regarding the generic
    power domains code during which we realized that trying to support
    certain combinations of PM config options was painful and not really
    worth it, because nobody would use them in production anyway. For
    this reason, we decided to make CONFIG_PM_SLEEP select
    CONFIG_PM_RUNTIME and that lead to the conclusion that the latter
    became redundant and CONFIG_PM could be used instead of it. The
    material here makes that replacement in a major part of the tree, but
    there will be at least one more batch of that in the second part of
    the merge window.

    Specifics:

    - Support for retrieving device properties information from ACPI _DSD
    device configuration objects and a unified device properties
    interface for device drivers (and subsystems) on top of that. As
    stated above, this works with Device Trees and ACPI and allows
    device drivers to be written in a platform firmware (DT or ACPI)
    agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are
    now going to use this new interface and the GPIO subsystem is
    additionally modified to allow device drivers to assign names to
    GPIO resources returned by ACPI _CRS objects (in case _DSD is not
    present or does not provide the expected data). The changes in
    this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron
    Lu, and Darren Hart with some fixes from others (Fabio Estevam,
    Geert Uytterhoeven).

    - Support for Hardware Managed Performance States (HWP) as described
    in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
    driver. CPUID is used to detect whether or not the feature is
    supported by the processor. If supported, it will be enabled
    automatically unless the intel_pstate=no_hwp switch is present in
    the kernel command line. From Dirk Brandewie.

    - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).

    - Support for firmware interface based on ACPI operation regions used
    by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
    platforms for power resource control and thermal management (Aaron
    Lu).

    - Limited support for retrieving off-the-hierarchy dependencies
    between devices from ACPI _DEP device configuration objects and
    deferred probing support for the ACPI battery driver based on the
    _DEP information to make that driver work on Asus T100A (Lan
    Tianyu).

    - New cpufreq driver for the Loongson1B processor (Kelvin Cheung).

    - ACPICA update to upstream revision 20141107 which only affects
    tools (Bob Moore).

    - Fixes for race conditions in the ACPICA's interrupt handling code
    and in the ACPI code related to system suspend and resume (Lv Zheng
    and Rafael J Wysocki).

    - ACPI core fix for an RCU-related issue in the ioremap() regions
    management code that slowed down significantly after CPUs had been
    allowed to enter idle states even if they'd had RCU callbakcs
    queued and triggered some problems in certain proprietary graphics
    driver (and elsewhere). The fix replaces synchronize_rcu() in that
    code with synchronize_rcu_expedited() which makes the issue go
    away. From Konstantin Khlebnikov.

    - ACPI LPSS (Low-Power Subsystem) driver fix to handle power
    management of the DMA engine included into the LPSS correctly. The
    problem is that the DMA engine doesn't have ACPI PM support of its
    own and it simply is turned off when the last LPSS device having
    ACPI PM support goes into D3cold. To work around that, the PM
    domain used by the ACPI LPSS driver is redesigned so at least one
    device with ACPI PM support will be on as long as the DMA engine is
    in use. From Andy Shevchenko.

    - ACPI backlight driver fix to avoid using it on "Win8-compatible"
    systems where it doesn't work and where it was used by default by
    mistake (Aaron Lu).

    - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
    Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin
    Chaugule (mostly related to the upcoming ARM64 support).

    - Intel RAPL (Running Average Power Limit) power capping driver fixes
    and improvements including new processor IDs (Jacob Pan).

    - Generic power domains modification to power up domains after
    attaching devices to them to meet the expectations of device
    drivers and bus types assuming devices to be accessible at probe
    time (Ulf Hansson).

    - Preliminary support for controlling device clocks from the generic
    power domains core code and modifications of the ARM/shmobile
    platform to use that feature (Ulf Hansson).

    - Assorted minor fixes and cleanups of the generic power domains core
    code (Ulf Hansson, Geert Uytterhoeven).

    - Assorted minor fixes and cleanups of the device clocks control code
    in the PM core (Geert Uytterhoeven, Grygorii Strashko).

    - Consolidation of device power management Kconfig options by making
    CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
    which is now redundant (Rafael J Wysocki and Kevin Hilman). That
    is the first batch of the changes needed for this purpose.

    - Core device runtime power management support code cleanup related
    to the execution of callbacks (Andrzej Hajda).

    - cpuidle ARM support improvements (Lorenzo Pieralisi).

    - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a
    new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
    Bartlomiej Zolnierkiewicz).

    - New cpufreq driver callback (->ready) to be executed when the
    cpufreq core is ready to use a given policy object and cpufreq-dt
    driver modification to use that callback for cooling device
    registration (Viresh Kumar).

    - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James
    Geboski, Tomeu Vizoso).

    - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
    cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
    Stefan Wahren, Petr Cvek).

    - OPP (Operating Performance Points) framework modification to allow
    OPPs to be removed too and update of a few cpufreq drivers
    (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
    during initialization) on driver removal (Viresh Kumar).

    - Hibernation core fixes and cleanups (Tina Ruchandani and Markus
    Elfring).

    - PM Kconfig fix related to CPU power management (Pankaj Dubey).

    - cpupower tool fix (Prarit Bhargava)"

    * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits)
    i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c
    dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    tools: cpupower: fix return checks for sysfs_get_idlestate_count()
    drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME
    MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    leds: leds-gpio: Fix multiple instances registration without 'label' property
    iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hwrandom / exynos / PM: Use CONFIG_PM in #ifdef
    block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    USB / PM: Drop CONFIG_PM_RUNTIME from the USB core
    PM: Merge the SET*_RUNTIME_PM_OPS() macros
    ...

    Linus Torvalds
     

10 Dec, 2014

6 commits

  • When a CPU is hotplugged, the current blk-mq spews a warning like:

    kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong.
    CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
    0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8
    ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58
    ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007
    Call Trace:
    [] dump_trace+0x86/0x330
    [] show_stack_log_lvl+0x94/0x170
    [] show_stack+0x21/0x50
    [] dump_stack+0x41/0x51
    [] kobject_add+0xa0/0xb0
    [] blk_mq_register_hctx+0x91/0xb0
    [] blk_mq_sysfs_register+0x3e/0x60
    [] blk_mq_queue_reinit_notify+0xf8/0x190
    [] notifier_call_chain+0x4c/0x70
    [] cpu_notify+0x23/0x50
    [] _cpu_up+0x157/0x170
    [] cpu_up+0x89/0xb0
    [] cpu_subsys_online+0x35/0x80
    [] device_online+0x5d/0xa0
    [] online_store+0x75/0x80
    [] kernfs_fop_write+0xda/0x150
    [] vfs_write+0xb2/0x1f0
    [] SyS_write+0x42/0xb0
    [] system_call_fastpath+0x16/0x1b
    [] 0x7f0132fb24e0

    This is indeed because of an uninitialized kobject for blk_mq_ctx.
    The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it
    goes loop over hctx_for_each_ctx(), i.e. it initializes only for
    online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly
    onlined CPU is registered without initialization.

    This patch fixes the issue by initializing the all ctx kobjects
    belonging to each queue.

    Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794
    Cc:
    Signed-off-by: Takashi Iwai
    Signed-off-by: Jens Axboe

    Takashi Iwai
     
  • Suppose that a system has two CPU sockets, three cores per socket,
    that it does not support hyperthreading and that four hardware
    queues are provided by a block driver. With the current algorithm
    this will lead to the following assignment of CPU cores to hardware
    queues:

    HWQ 0: 0 1
    HWQ 1: 2 3
    HWQ 2: 4 5
    HWQ 3: (none)

    This patch changes the queue assignment into:

    HWQ 0: 0 1
    HWQ 1: 2
    HWQ 2: 3 4
    HWQ 3: 5

    In other words, this patch has the following three effects:
    - All four hardware queues are used instead of only three.
    - CPU cores are spread more evenly over hardware queues. For the
    above example the range of the number of CPU cores associated
    with a single HWQ is reduced from [0..2] to [1..2].
    - If the number of HWQ's is a multiple of the number of CPU sockets
    it is now guaranteed that all CPU cores associated with a single
    HWQ reside on the same CPU socket.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Sagi Grimberg
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
    calls into a single call.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • What we need is the following two guarantees:
    * Any thread that observes the effect of the test_and_set_bit() by
    __bt_get_word() also observes the preceding addition of 'current'
    to the appropriate wait list. This is guaranteed by the semantics
    of the spin_unlock() operation performed by prepare_and_wait().
    Hence the conversion of test_and_set_bit_lock() into
    test_and_set_bit().
    * The wait lists are examined by bt_clear() after the tag bit has
    been cleared. clear_bit_unlock() guarantees that any thread that
    observes that the bit has been cleared also observes the store
    operations preceding clear_bit_unlock(). However,
    clear_bit_unlock() does not prevent that the wait lists are examined
    before that the tag bit is cleared. Hence the addition of a memory
    barrier between clear_bit() and the wait list examination.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • If __bt_get_word() is called with last_tag != 0, if the first
    find_next_zero_bit() fails, if after wrap-around the
    test_and_set_bit() call fails and find_next_zero_bit() succeeds,
    if the next test_and_set_bit() call fails and subsequently
    find_next_zero_bit() does not find a zero bit, then another
    wrap-around will occur. Avoid this by introducing an additional
    local variable.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • blk-mq users are allowed to free the memory request_queue.tag_set
    points at after blk_cleanup_queue() has finished but before
    blk_release_queue() has started. This can happen e.g. in the SCSI
    core. The SCSI core namely embeds the tag_set structure in a SCSI
    host structure. The SCSI host structure is freed by
    scsi_host_dev_release(). This function is called after
    blk_cleanup_queue() finished but can be called before
    blk_release_queue().

    This means that it is not safe to access request_queue.tag_set from
    inside blk_release_queue(). Hence remove the blk_sync_queue() call
    from blk_release_queue(). This call is not necessary - outstanding
    requests must have finished before blk_release_queue() is
    called. Additionally, move the blk_mq_free_queue() call from
    blk_release_queue() to blk_cleanup_queue() to avoid that struct
    request_queue.tag_set gets accessed after it has been freed.

    This patch avoids that the following kernel oops can be triggered
    when deleting a SCSI host for which scsi-mq was enabled:

    Call Trace:
    [] lock_acquire+0xc4/0x270
    [] mutex_lock_nested+0x61/0x380
    [] blk_mq_free_queue+0x30/0x180
    [] blk_release_queue+0x84/0xd0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] blk_put_queue+0x15/0x20
    [] disk_release+0x99/0xd0
    [] device_release+0x36/0xb0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] put_disk+0x1a/0x20
    [] __blkdev_put+0x135/0x1b0
    [] blkdev_put+0x50/0x160
    [] kill_block_super+0x44/0x70
    [] deactivate_locked_super+0x44/0x60
    [] deactivate_super+0x4e/0x70
    [] cleanup_mnt+0x43/0x90
    [] __cleanup_mnt+0x12/0x20
    [] task_work_run+0xac/0xe0
    [] do_notify_resume+0x61/0xa0
    [] int_signal+0x12/0x17

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Dec, 2014

3 commits

  • Pull SCSI updates from James Bottomley:
    "This patch is the usual mix of driver updates (srp, ipr, scsi_debug,
    NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas).

    Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim
    driver and the extensive work by Finn Thain rewrites all the NCR5380
    based drivers.

    There's also extensive infrastructure updates: a new logging
    infrastructure for sense information and a rewrite of the tagged
    command queue API and an assortment of minor updates"

    * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (183 commits)
    scsi: set fmt to NULL scsi_extd_sense_format() by default
    libsas: remove task_collector mode
    wd719x: remove dma_cache_sync call
    scsi_debug: add Report supported opcodes+tmfs; Compare and write
    scsi_debug: change SCSI command parser to table driven
    scsi_debug: add Capacity Changed Unit Attention
    scsi_debug: append inject error flags onto scsi_cmnd object
    scsi_debug: pinpoint invalid field in sense data
    wd719x: Add firmware documentation
    wd719x: Introduce Western Digital WD7193/7197/7296 PCI SCSI card driver
    eeprom-93cx6: Add (read-only) support for 8-bit mode
    esas2r: fix an oversight in setting return value
    esas2r: fix an error path in esas2r_ioctl_handler
    esas2r: fir error handling in do_fm_api
    scsi: add SPC-3 command definitions
    scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16
    scsi: remove scsi_driver owner field
    scsi: move scsi_dispatch_cmd to scsi_lib.c
    scsi: stop passing a gfp_mask argument down the command setup path
    scsi: remove scsi_next_command
    ...

    Linus Torvalds
     
  • When one hardware queue has no mapped software queues, it
    shouldn't have been scheduled. Otherwise WARNING or OOPS
    can triggered.

    blk_mq_hw_queue_mapped() helper is introduce for fixing
    the problem.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • * pm-runtime: (25 commits)
    i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c
    dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME
    MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    hwrandom / exynos / PM: Use CONFIG_PM in #ifdef
    block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
    USB / PM: Drop CONFIG_PM_RUNTIME from the USB core
    PM: Merge the SET*_RUNTIME_PM_OPS() macros
    PM / Kconfig: Do not select PM directly from Kconfig files
    PCI / PM: Drop CONFIG_PM_RUNTIME from the PCI core
    ...

    Rafael J. Wysocki
     

08 Dec, 2014

3 commits


04 Dec, 2014

1 commit

  • After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
    selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks
    depending on CONFIG_PM_RUNTIME may now be changed to depend on
    CONFIG_PM.

    Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core.

    Reviewed-by: Aaron Lu
    Acked-by: Jens Axboe
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

02 Dec, 2014

1 commit

  • bio integrity handling is broken on a system with LVM layered atop a
    DIF/DIX SCSI drive because device mapper clones the bio, modifies the
    clone, and sends the clone to the lower layers for processing.
    However, the clone bio has bi_vcnt == 0, which means that when the sd
    driver calls bio_integrity_process to attach DIX data, the
    for_each_segment_all() call (which uses bi_vcnt) returns immediately
    and random garbage is sent to the disk on a disk write. The disk of
    course returns an error.

    Therefore, teach bio_integrity_process() to use bio_for_each_segment()
    to iterate the bio_vecs, since the per-bio iterator tracks which
    bio_vecs are associated with that particular bio. The integrity
    handling code is effectively part of the "driver" (it's not the bio
    owner), so it must use the correct iterator function.

    v2: Fix a compiler warning about abandoned local variables. This
    patch supersedes "block: bio_integrity_process uses wrong bio_vec
    iterator". Patch applies against 3.18-rc6.

    Signed-off-by: Darrick J. Wong
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Darrick J. Wong
     

01 Dec, 2014

1 commit


25 Nov, 2014

3 commits


24 Nov, 2014

2 commits


20 Nov, 2014

1 commit

  • We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
    with a user passed in partno value. If we pass in 0x7fffffff, the
    new target in disk_expand_part_tbl() overflows the 'int' and we
    access beyond the end of ptbl->part[] and even write to it when we
    do the rcu_assign_pointer() to assign the new partition.

    Reported-by: David Ramos
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Nov, 2014

2 commits


12 Nov, 2014

5 commits

  • Currently scsi piggy backs on the block layer to define the concept
    of a tagged command. But we want to be able to have block-level host-wide
    tags assigned even for untagged commands like the initial INQUIRY, so add
    a new SCSI-level flag for commands that are tagged at the scsi level, so
    that even commands without that set can have tags assigned to them. Note
    that this alredy is the case for the blk-mq code path, and this just lets
    the old path catch up with it.

    We also set this flag based upon sdev->simple_tags instead of the block
    queue flag, so that it is entirely independent of the block layer tagging,
    and thus always correct even if a driver doesn't use block level tagging
    yet.

    Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and
    removing it forces them to look for the proper replacement.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Christie
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke

    Christoph Hellwig
     
  • The queuecommand() callback functions in SCSI low-level drivers
    need to know which hardware context has been selected by the
    block layer. Since this information is not available in the
    request structure, and since passing the hctx pointer directly to
    the queuecommand callback function would require modification of
    all SCSI LLDs, add a function to the block layer that allows to
    query the hardware context index.

    Signed-off-by: Bart Van Assche
    Acked-by: Jens Axboe
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig

    Bart Van Assche
     
  • For cloned bio, bio->bi_vcnt can't be used at all, and we
    have resort to bio_segments() to figure out how many
    segment there are in the bio.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk-mq is using preempt_disable/enable in order to ensure that the
    queue runners are placed on the right CPU. This does not work with
    the RT patches, because __blk_mq_run_hw_queue takes a non-raw
    spinlock with the preemption-disabled region. If there is contention
    on the lock, this violates the rules for preemption-disabled regions.

    While this should be easily fixable within the RT patches just by doing
    migrate_disable/enable, we can do better and document _why_ this
    particular region runs with disabled preemption. After the previous
    patch, it is trivial to switch it to get/put_cpu; the RT patches then
    can change it to get_cpu_light, which lets virtio-blk run under RT
    kernels.

    Cc: Jens Axboe
    Cc: Thomas Gleixner
    Reported-by: Clark Williams
    Tested-by: Clark Williams
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • preempt_disable/enable surrounds every call to blk_mq_run_hw_queue,
    except the one in blk-flush.c. In fact that one is always asynchronous,
    and it does not need smp_processor_id().

    We can do the same for all other calls, avoiding preempt_disable when
    async is true. This avoids peppering blk-mq.c with preemption-disabled
    regions.

    Cc: Jens Axboe
    Cc: Thomas Gleixner
    Reported-by: Clark Williams
    Tested-by: Clark Williams
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     

11 Nov, 2014

1 commit


05 Nov, 2014

1 commit

  • q->mq_usage_counter is a percpu_ref which is killed and drained when
    the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit()
    which involves freezing the queue is invoked on all existing queues.
    Because percpu_ref killing and draining involve a RCU grace period,
    doing the above on one queue after another may take a long time if
    there are many queues on the system.

    This patch splits out initiation of freezing and waiting for its
    completion, and updates blk_mq_queue_reinit_notify() so that the
    queues are frozen in parallel instead of one after another. Note that
    freezing and unfreezing are moved from blk_mq_queue_reinit() to
    blk_mq_queue_reinit_notify().

    Signed-off-by: Tejun Heo
    Reported-by: Christian Borntraeger
    Tested-by: Christian Borntraeger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

31 Oct, 2014

1 commit

  • Priority of a merged request is computed by ioprio_best(). If one of the
    requests has undefined priority (IOPRIO_CLASS_NONE) and another request
    has priority from IOPRIO_CLASS_BE, the function will return the
    undefined priority which is wrong. Fix the function to properly return
    priority of a request with the defined priority.

    Fixes: d58cdfb89ce0c6bd5f81ae931a984ef298dbda20
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jan Kara
     

30 Oct, 2014

2 commits

  • Drivers can now tell blk-mq if they take advantage of the deferred
    issue through 'last' or not. If they do, don't do queue-direct
    for sync IO. This is a preparation patch for the nvme conversion.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since we have the notion of a 'last' request in a chain, we can use
    this to have the hardware optimize the issuing of requests. Add
    a list_head parameter to queue_rq that the driver can use to
    temporarily store hw commands for issue when 'last' is true. If we
    are doing a chain of requests, pass in a NULL list for the first
    request to force issue of that immediately, then batch the remainder
    for deferred issue until the last request has been sent.

    Instead of adding yet another argument to the hot ->queue_rq path,
    encapsulate the passed arguments in a blk_mq_queue_data structure.
    This is passed as a constant, and has been tested as faster than
    passing 4 (or even 3) args through ->queue_rq. Update drivers for
    the new ->queue_rq() prototype. There are no functional changes
    in this patch for drivers - if they don't use the passed in list,
    then they will just queue requests individually like before.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Oct, 2014

1 commit

  • while compiling integer err was showing as a set but unused variable.
    elevator_init_fn can be either cfq_init_queue or deadline_init_queue
    or noop_init_queue.
    all three of these functions are returning -ENOMEM if they fail to
    allocate the queue.
    so we should actually be returning the error code rather than
    returning 0 always.

    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Jens Axboe

    Sudip Mukherjee
     

23 Oct, 2014

1 commit

  • When sg_scsi_ioctl() fails to prepare request to submit in
    blk_rq_map_kern() we jump to a label where we just end up copying
    (luckily zeroed-out) kernel buffer to userspace instead of reporting
    error. Fix the problem by jumping to the right label.

    CC: Jens Axboe
    CC: linux-scsi@vger.kernel.org
    CC: stable@vger.kernel.org
    Coverity-id: 1226871
    Signed-off-by: Jan Kara

    Fixed up the, now unused, out label.

    Signed-off-by: Jens Axboe

    Jan Kara
     

22 Oct, 2014

1 commit

  • The problem is introduced by commit 764f612c6c3c231b(blk-merge:
    don't compute bi_phys_segments from bi_vcnt for cloned bio),
    and merge is needed if number of current segment isn't less than
    max segments.

    Strictly speaking, bio->bi_vcnt shouldn't be used here since
    it may not be accurate in cases of both cloned bio or bio cloned
    from, but bio_segments() is a bit expensive, and bi_vcnt is still
    the biggest number, so the approach should work.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei