22 Mar, 2017

2 commits


17 Mar, 2017

3 commits


04 Mar, 2017

1 commit

  • Pull block layer fixes from Jens Axboe:
    "A collection of fixes for this merge window, either fixes for existing
    issues, or parts that were waiting for acks to come in. This pull
    request contains:

    - Allocation of nvme queues on the right node from Shaohua.

    This was ready long before the merge window, but waiting on an ack
    from Bjorn on the PCI bit. Now that we have that, the three patches
    can go in.

    - Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
    request allocations. This caused an oops. One part from Sagi, one
    part from Omar.

    - A loop partition scan deadlock fix from Omar, fixing a regression
    in this merge window.

    - A three-patch series from Keith, closing up a hole on clearing out
    requests on shutdown/resume.

    - A stable fix for nbd from Josef, fixing a leak of sockets.

    - Two fixes for a regression in this window from Jan, fixing a
    problem with one of his earlier patches dealing with queue vs bdi
    life times.

    - A fix for a regression with virtio-blk, causing an IO stall if
    scheduling is used. From me.

    - A fix for an io context lock ordering problem. From me"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: Move bdi_unregister() to del_gendisk()
    blk-mq: ensure that bd->last is always set correctly
    block: don't call ioc_exit_icq() with the queue lock held for blk-mq
    block: Initialize bd_bdi on inode initialization
    loop: fix LO_FLAGS_PARTSCAN hang
    nvme: Complete all stuck requests
    blk-mq: Provide freeze queue timeout
    blk-mq: Export blk_mq_freeze_queue_wait
    nbd: stop leaking sockets
    blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
    blk-mq: kill blk_mq_set_alloc_data()
    blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
    blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
    nvme: allocate nvme_queue in correct node
    PCI: add an API to get node from vector
    blk-mq: allocate blk_mq_tags and requests in correct node

    Linus Torvalds
     

02 Mar, 2017

3 commits

  • If the nvme driver is shutting down its controller, the drievr will not
    start the queues up again, preventing blk-mq's hot CPU notifier from
    making forward progress.

    To fix that, this patch starts a request_queue freeze when the driver
    resets a controller so no new requests may enter. The driver will wait
    for frozen after IO queues are restarted to ensure the queue reference
    can be reinitialized when nvme requests to unfreeze the queues.

    If the driver is doing a safe shutdown, the driver will wait for the
    controller to successfully complete all inflight requests so that we
    don't unnecessarily fail them. Once the controller has been disabled,
    the queues will be restarted to force remaining entered requests to end
    in failure so that blk-mq's hot cpu notifier may progress.

    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • nvme_queue is per-cpu queue (mostly). Allocating it in node where blk-mq
    will use it.

    Signed-off-by: Shaohua Li
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • We don't actually need the full rculist.h header in sched.h anymore,
    we will be able to include the smaller rcupdate.h header instead.

    But first update code that relied on the implicit header inclusion.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit


26 Feb, 2017

1 commit

  • Pull rdma DMA mapping updates from Doug Ledford:
    "Drop IB DMA mapping code and use core DMA code instead.

    Bart Van Assche noted that the ib DMA mapping code was significantly
    similar enough to the core DMA mapping code that with a few changes it
    was possible to remove the IB DMA mapping code entirely and switch the
    RDMA stack to use the core DMA mapping code.

    This resulted in a nice set of cleanups, but touched the entire tree
    and has been kept separate for that reason."

    * tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits)
    IB/rxe, IB/rdmavt: Use dma_virt_ops instead of duplicating it
    IB/core: Remove ib_device.dma_device
    nvme-rdma: Switch from dma_device to dev.parent
    RDS: net: Switch from dma_device to dev.parent
    IB/srpt: Modify a debug statement
    IB/srp: Switch from dma_device to dev.parent
    IB/iser: Switch from dma_device to dev.parent
    IB/IPoIB: Switch from dma_device to dev.parent
    IB/rxe: Switch from dma_device to dev.parent
    IB/vmw_pvrdma: Switch from dma_device to dev.parent
    IB/usnic: Switch from dma_device to dev.parent
    IB/qib: Switch from dma_device to dev.parent
    IB/qedr: Switch from dma_device to dev.parent
    IB/ocrdma: Switch from dma_device to dev.parent
    IB/nes: Remove a superfluous assignment statement
    IB/mthca: Switch from dma_device to dev.parent
    IB/mlx5: Switch from dma_device to dev.parent
    IB/mlx4: Switch from dma_device to dev.parent
    IB/i40iw: Remove a superfluous assignment statement
    IB/hns: Switch from dma_device to dev.parent
    ...

    Linus Torvalds
     

24 Feb, 2017

1 commit


23 Feb, 2017

19 commits

  • Adds support for detection of the NVMe controller found in the
    following recent MacBooks:
    - Retina MacBook 2016 (MacBook9,1)
    - 13" MacBook Pro 2016 without Touch Bar (MacBook13,1)
    - 13" MacBook Pro 2016 with Touch Bar (MacBook13,2)

    Signed-off-by: Daniel Roschka
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Daniel Roschka
     
  • This will enable the user to control the specific interface for
    connection establishment in case the host has more than 1 interface
    under the same subnet.
    E.g:
    Host interfaces configured as:
    - ib0 1.1.1.1/16
    - ib1 1.1.1.2/16

    Target interfaces configured as:
    - ib0 1.1.1.3/16 (listener interface)
    - ib1 1.1.1.4/16

    the following connect command will go through host iface ib0 (default):
    nvme connect -t rdma -n testsubsystem -a 1.1.1.3 -s 1023

    but the following command will go through host iface ib1:
    nvme connect -t rdma -n testsubsystem -a 1.1.1.3 -s 1023 -w 1.1.1.2

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Parav Pandit
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • According to the preceeding goto, it is likely that 'out_destroy_sq' was
    expected here.

    Signed-off-by: Christophe JAILLET
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christophe JAILLET
     
  • Also remove redundant debug prints.

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Parav Pandit
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • This will enable the usage for nvme rdma target.
    Also move from a lookup array to a switch statement.

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Parav Pandit
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Discovery controllers don't set the values. They are in reserved
    areas of the Identify Controller data structure.

    Given the cmd completed, the minimal capsule sizes are supported,
    so no need to check nqn to detect discovery controllers and
    special case validations.

    Signed-off-by: James Smart
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    James Smart
     
  • This driver previously required we have a special check for IO submitted
    to nvme IO queues that are temporarily suspended. That is no longer
    necessary since blk-mq provides a quiesce, so any IO that actually gets
    submitted to such a queue must be ended since the queue isn't going to
    start back up.

    This is fixing a condition where we have fewer IO queues after a
    controller reset. This may happen if the number of CPU's has changed,
    or controller firmware update changed the queue count, for example.

    While it may be possible to complete the IO on a different queue, the
    block layer does not provide a way to resubmit a request on a different
    hardware context once the request has entered the queue. We don't want
    these requests to be stuck indefinitely either, so ending them in error
    is our only option at the moment.

    Signed-off-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • If a namespace has already been marked dead, we don't want to kick the
    request_queue again since we may have just freed it from another thread.

    Signed-off-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • If the device is not present, the driver should disable the queues
    immediately. Prior to this, the driver was relying on the watchdog timer
    to kill the queues if requests were outstanding to the device, and that
    just delays removal up to one second.

    Signed-off-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • NVMe devices can advertise multiple power states. These states can
    be either "operational" (the device is fully functional but possibly
    slow) or "non-operational" (the device is asleep until woken up).
    Some devices can automatically enter a non-operational state when
    idle for a specified amount of time and then automatically wake back
    up when needed.

    The hardware configuration is a table. For each state, an entry in
    the table indicates the next deeper non-operational state, if any,
    to autonomously transition to and the idle time required before
    transitioning.

    This patch teaches the driver to program APST so that each successive
    non-operational state will be entered after an idle time equal to 100%
    of the total latency (entry plus exit) associated with that state.
    The maximum acceptable latency is controlled using dev_pm_qos
    (e.g. power/pm_qos_latency_tolerance_us in sysfs); non-operational
    states with total latency greater than this value will not be used.
    As a special case, setting the latency tolerance to 0 will disable
    APST entirely. On hardware without APST support, the sysfs file will
    not be exposed.

    The latency tolerance for newly-probed devices is set by the module
    parameter nvme_core.default_ps_max_latency_us.

    In theory, the device can expose "default" APST table, but this
    doesn't seem to function correctly on my device (Samsung 950), nor
    does it seem particularly useful. There is also an optional
    mechanism by which a configuration can be "saved" so it will be
    automatically loaded on reset. This can be configured from
    userspace, but it doesn't seem useful to support in the driver.

    On my laptop, enabling APST seems to save nearly 1W.

    The hardware tables can be decoded in userspace with nvme-cli.
    'nvme id-ctrl /dev/nvmeN' will show the power state table and
    'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
    configuration.

    This feature is quirked off on a known-buggy Samsung device.

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Andy Lutomirski
     
  • Currently, all NVMe quirks are based on PCI IDs. Add a mechanism to
    define quirks based on identify_ctrl's vendor id, model number,
    and/or firmware revision.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Andy Lutomirski
     
  • nvmf_create_ctrl() relys on the presence of a create_crtl callback in the
    registered nvmf_transport_ops, so make nvmf_register_transport require one.

    Update the available call-sites as well to reflect these changes.

    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • This patch defines CNS field as 8-bit field and avoids cpu_to/from_le
    conversions.
    Also initialize nvme_command cns value explicitly to NVME_ID_CNS_NS
    for readability (don't rely on the fact that NVME_ID_CNS_NS = 0).

    Reviewed-by: Max Gurtovoy
    Signed-off-by: Parav Pandit
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Parav Pandit
     
  • Reviewed-by: Parav Pandit
    Signed-off-by: Max Gurtovoy
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • No need to dereference req twice to get the cmd when we already
    have it stored in a local variable.

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Parav Pandit
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Easier for debugging and testing state machine
    transitions.

    Signed-off-by: Sagi Grimberg
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • We usually log the cntlid which is confusing in case
    we have multiple subsystems each with it's own cntlid ida.
    Instead make cntlid ida globally unique and log the initial
    association.

    Signed-off-by: Sagi Grimberg
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • Cleanup of abort flag processing in fcp_op_done.
    References were unnecessary

    Signed-off-by: James Smart
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    James Smart
     
  • trivial fix to spelling mistake in pr_err message

    Signed-off-by: Colin Ian King
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Colin Ian King
     

18 Feb, 2017

4 commits


15 Feb, 2017

1 commit

  • When CONFIG_KASAN is enabled, compilation fails:

    block/sed-opal.c: In function 'sed_ioctl':
    block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

    Moved all the ioctl structures off the stack and dynamically allocate
    using _IOC_SIZE()

    Fixes: 455a7b238cd6 ("block: Add Sed-opal library")

    Reported-by: Arnd Bergmann
    Signed-off-by: Scott Bauer
    Signed-off-by: Jens Axboe

    Scott Bauer
     

09 Feb, 2017

1 commit


07 Feb, 2017

1 commit


01 Feb, 2017

2 commits

  • Instead of keeping two levels of indirection for requests types, fold it
    all into the operations. The little caveat here is that previously
    cmd_type only applied to struct request, while the request and bio op
    fields were set to plain REQ_OP_READ/WRITE even for passthrough
    operations.

    Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
    private requests, althought it has to add two for each so that we
    can communicate the data in/out nature of the request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This can be used to check for fs vs non-fs requests and basically
    removes all knowledge of BLOCK_PC specific from the block layer,
    as well as preparing for removing the cmd_type field in struct request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig