10 Oct, 2015

6 commits

  • The buildbot complains about this even if it doesn't generate
    a a build warning. But it's an easy fix, so here we go:

    Reported-by: kbuild test robot
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently all NVMe command and completion structures are exposed to userspace
    through the uapi version of nvme.h. They are not an ABI between the kernel
    and userspace, and will change in C-incompatible way for future versions of
    the spec. Move them to the kernel version of the file and rename the uapi
    header to nvme_ioctl.h so that userspace can easily detect the presence of
    the new clean header. Nvme-cli already carries a local copy of the header,
    so it won't be affected by this move.

    Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add a new drivers/block/nvme.h which contains all the driver internal
    interface.

    Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Releasing IO queues and disks was done in a work queue outside the
    controller resume context to delete namespaces if the controller failed
    after a resume from suspend. This is unnecessary since we can resume
    a device asynchronously.

    This patch makes resume use probe_work so it can directly remove
    namespaces if the device is manageable but not IO capable. Since the
    deleting disks was the only reason we had the convoluted "reset_workfn",
    this patch removes that unnecessary indirection.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Dynamic namespace attachment means the namespace may be removed at any
    time, so the namespace reference count can not be tied to the device
    reference count. This fixes a NULL dereference if an opened namespace
    is detached from a controller.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

19 Aug, 2015

2 commits

  • Controllers can perform optional subsystem resets as introduced in NVMe
    1.1. This patch adds an IOCTL to trigger the subsystem reset by writing
    "NVMe" to the NSSR register.

    Signed-off-by: Jon Derrick
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jon Derrick
     
  • Controllers part of an NVMe subsystem may be reset by any other controller
    in the subsystem. If the device is capable of subsystem resets, this
    patch adds detection for such events and performs appropriate controller
    initialization upon subsystem reset detection.

    The register bit is a RW1C type, so the driver needs to write a 1 to the
    status bit to clear the subsystem reset occured bit during initialization.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

21 Jul, 2015

1 commit

  • Some controllers have a controller-side memory buffer available for use
    for submissions, completions, lists, or data.

    If a CMB is available, the entire CMB will be ioremapped and it will
    attempt to map the IO SQes onto the CMB. The queues will be shrunk as
    needed. The CMB will not be used if the queue depth is shrunk below some
    threshold where it may have reduced performance over a larger queue
    in system memory.

    Signed-off-by: Jon Derrick
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jon Derrick
     

06 Jun, 2015

1 commit

  • Namespaces may be dynamically allocated and deleted or attached and
    detached. This has the driver rescan the device for namespace changes
    after each device reset or namespace change asynchronous event.

    There could potentially be many detached namespaces that we don't want
    polluting /dev/ with unusable block handles, so this will delete disks
    if the namespace is not active as indicated by the response from identify
    namespace. This also skips adding the disk if no capacity is provisioned
    to the namespace in the first place.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

22 May, 2015

3 commits

  • Use block layer queues with an internal cmd_type to submit internally
    generated NVMe commands. This both simplifies the code a lot and allow
    for a better structure. For example now the LighNVM code can construct
    commands without knowing the details of the underlying I/O descriptors.
    Or a future NVMe over network target could inject commands, as well as
    could the SCSI translation and ioctl code be reused for such a beast.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Most users want the generic device, so store that in struct nvme_dev
    instead of the pci_dev. This also happens to be a nice step towards
    making some code reusable for non-PCI transports.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Note that we keep the unused timeout argument, but allow callers to
    pass 0 instead of a timeout if they want the default. This will allow
    adding a timeout to the pass through path later on.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Apr, 2015

1 commit


20 Feb, 2015

5 commits

  • The driver has to end unreturned commands at some point even if the
    controller has not provided a completion. The driver tried to be safe by
    deleting IO queues prior to ending all unreturned commands. That should
    cause the controller to internally abort inflight commands, but IO queue
    deletion request does not have to be successful, so all bets are off. We
    still have to make progress, so to be extra safe, this patch doesn't
    clear a queue to release the dma mapping for a command until after the
    pci device has been disabled.

    This patch removes the special handling during device initialization
    so controller recovery can be done all the time. This is possible since
    initialization is not inlined with pci probe anymore.

    Reported-by: Nilish Choudhury
    Signed-off-by: Keith Busch

    Keith Busch
     
  • This performs the longest parts of nvme device probe in scheduled work.
    This speeds up probe significantly when multiple devices are in use.

    Signed-off-by: Keith Busch

    Keith Busch
     
  • This creates a new class type for nvme devices to register their
    management character devices with. This is so we do not rely on miscdev
    to provide enough minors for as many nvme devices some people plan to
    use. The previous limit was approximately 60 NVMe controllers, depending
    on the platform and kernel. Now the limit is 1M, which ought to be enough
    for anybody.

    Since we have a new device class, it makes sense to attach the block
    devices under this as well, so part of this patch moves the management
    handle initialization prior to the namespaces discovery.

    Signed-off-by: Keith Busch

    Keith Busch
     
  • The original translation created collisions on Inquiry VPD 83 for many
    existing devices. Newer specifications provide other ways to translate
    based on the device's version can be used to create unique identifiers.

    Version 1.1 provides an EUI64 field that uniquely identifies each
    namespace, and 1.2 added the longer NGUID field for the same reason.
    Both follow the IEEE EUI format and readily translate to the SCSI device
    identification EUI designator type 2h. For devices implementing either,
    the translation will use this type, defaulting to the EUI64 8-byte type if
    implemented then NGUID's 16 byte version if not. If neither are provided,
    the 1.0 translation is used, and is updated to use the SCSI String format
    to guarantee a unique identifier.

    Knowing when to use the new fields depends on the nvme controller's
    revision. The NVME_VS macro was not decoding this correctly, so that is
    fixed in this patch and moved to a more appropriate place.

    Since the Identify Namespace structure required an update for the NGUID
    field, this patch adds the remaining new 1.2 fields to the structure.

    Signed-off-by: Keith Busch

    Keith Busch
     
  • Adds support for NVMe metadata formats and exposes block devices for
    all namespaces regardless of their format. Namespace formats that are
    unusable will have disk capacity set to 0, but a handle to the block
    device is created to simplify device management. A namespace is not
    usable when the format requires host interleave block and metadata in
    single buffer, has no provisioned storage, or has better data but failed
    to register with blk integrity.

    The namespace has to be scanned in two phases to support separate
    metadata formats. The first establishes the sector size and capacity
    prior to invoking add_disk. If metadata is required, the capacity will
    be temporarilly set to 0 until it can be revalidated and registered with
    the integrity extenstions after add_disk completes.

    The driver relies on the integrity extensions to provide the metadata
    buffer. NVMe requires this be a single physically contiguous region,
    so only one integrity segment is allowed per command. If the metadata
    is used for T10 PI, the driver provides mappings to save and restore
    the reftag physical block translation. The driver provides no-op
    functions for generate and verify if metadata is not used for protection
    information. This way the setup is always provided by the block layer.

    If a request does not supply a required metadata buffer, the command
    is failed with bad address. This could only happen if a user manually
    disables verify/generate on such a disk. The only exception to where
    this is okay is if the controller is capable of stripping/generating
    the metadata, which is possible on some types of formats.

    The metadata scatter gather list now occupies the spot in the nvme_iod
    that used to be used to link retryable IOD's, but we don't do that
    anymore, so the field was unused.

    Signed-off-by: Keith Busch

    Keith Busch
     

30 Jan, 2015

1 commit

  • Currently we allocate an nvme_iod for each IO, which holds the
    sg list, prps, and other IO related info. Set a threshold of
    2 pages and/or 8KB of data, below which we can just embed this
    in the per-command pdu in blk-mq. For any IO at or below
    NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.

    For higher IOPS, this saves up to 1% of CPU time.

    Signed-off-by: Jens Axboe
    Reviewed-by: Keith Busch

    Jens Axboe
     

05 Nov, 2014

3 commits

  • This converts the NVMe driver to a blk-mq request-based driver.

    The NVMe driver is currently bio-based and implements queue logic within
    itself. By using blk-mq, a lot of these responsibilities can be moved
    and simplified.

    The patch is divided into the following blocks:

    * Per-command data and cmdid have been moved into the struct request
    field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
    maintenance are now handled by blk-mq through the rq->tag field.

    * The logic for splitting bio's has been moved into the blk-mq layer.
    The driver instead notifies the block layer about limited gap support in
    SG lists.

    * blk-mq handles timeouts and is reimplemented within nvme_timeout().
    This both includes abort handling and command cancelation.

    * Assignment of nvme queues to CPUs are replaced with the blk-mq
    version. The current blk-mq strategy is to assign the number of
    mapped queues and CPUs to provide synergy, while the nvme driver
    assign as many nvme hw queues as possible. This can be implemented in
    blk-mq if needed.

    * NVMe queues are merged with the tags structure of blk-mq.

    * blk-mq takes care of setup/teardown of nvme queues and guards invalid
    accesses. Therefore, RCU-usage for nvme queues can be removed.

    * IO tracing and accounting are handled by blk-mq and therefore removed.

    * Queue suspension logic is replaced with the logic from the block
    layer.

    Contributions in this patch from:

    Sam Bradshaw
    Jens Axboe
    Keith Busch
    Robert Nelson

    Acked-by: Keith Busch
    Acked-by: Jens Axboe

    Updated for new ->queue_rq() prototype.

    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Adds support for devices with max page size smaller than the host's.
    In the case we encounter such a host/device combination, the driver will
    split a page into as many PRP entries as necessary for the device's page
    size capabilities. If the device's reported minimum page size is greater
    than the host's, the driver will not attempt to enable the device and
    return an error instead.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Submits NVMe asynchronous event requests, one event up to the controller
    maximum or number of possible different event types (8), whichever is
    smaller. Events successfully returned by the controller are logged.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Jens Axboe

    Keith Busch
     

13 Jun, 2014

1 commit

  • There is a potential dead lock if a cpu event occurs during nvme probe
    since it registered with hot cpu notification. This fixes the race by
    having the module register with notification outside of probe rather
    than have each device register.

    The actual work is done in a scheduled work queue instead of in the
    notifier since assigning IO queues has the potential to block if the
    driver creates additional queues.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     

04 Jun, 2014

1 commit


05 May, 2014

3 commits

  • It is possible a filesystem may send a flush flagged bio with write
    data. There is no such composite NVMe command, so the driver sends flush
    and write separately.

    The device is allowed to execute these commands in any order, so it was
    possible the driver ends the bio after the write completes, but while the
    flush is still active. We don't want to let a filesystem believe flush
    succeeded before it really has; this could cause data corruption on a
    power loss between these events. To fix, this patch splits the flush
    and write into chained bios.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • This configures an nvme request_queue as flush capable if the device
    has a volatile write cache present.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • Make the copyright dates accurate and remove the final paragraph that
    includes the address of the FSF.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

12 Apr, 2014

1 commit

  • Pull NVMe driver updates from Matthew Wilcox:
    "Various updates to the NVMe driver. The most user-visible change is
    that drive hotplugging now works and CPU hotplug while an NVMe drive
    is installed should also work better"

    * git://git.infradead.org/users/willy/linux-nvme:
    NVMe: Retry failed commands with non-fatal errors
    NVMe: Add getgeo to block ops
    NVMe: Start-stop nvme_thread during device add-remove.
    NVMe: Make I/O timeout a module parameter
    NVMe: CPU hot plug notification
    NVMe: per-cpu io queues
    NVMe: Replace DEFINE_PCI_DEVICE_TABLE
    NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmds
    NVMe: IOCTL path RCU protect queue access
    NVMe: RCU protected access to io queues
    NVMe: Initialize device reference count earlier
    NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions

    Linus Torvalds
     

11 Apr, 2014

4 commits

  • For commands returned with failed status, queue these for resubmission
    and continue retrying them until success or for a limited amount of
    time. The final timeout was arbitrarily chosen so requests can't be
    retried indefinitely.

    Since these are requeued on the nvmeq that submitted the command, the
    callbacks have to take an nvmeq instead of an nvme_dev as a parameter
    so that we can use the locked queue to append the iod to retry later.

    The nvme_iod conviently can be used to track how long we've been trying
    to successfully complete an iod request. The nvme_iod also provides the
    nvme prp dma mappings, so I had to move a few things around so we can
    keep those mappings.

    Signed-off-by: Keith Busch
    [fixed checkpatch issue with long line]
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • Increase the default timeout to 30 seconds to match SCSI.

    Signed-off-by: Keith Busch
    [use byte instead of ushort]
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • Registers with hot cpu notification to rebalance, and potentially allocate
    additional, io queues.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • The device's IO queues are associated with CPUs, so we can use a per-cpu
    variable to map the a qid to a cpu. This provides a convienient way
    to optimally assign queues to multiple cpus when the device supports
    fewer queues than the host has cpus. The previous implementation may
    have assigned these poorly in these situations. This patch addresses
    this by sharing queues among cpus that are "close" together and should
    have a lower lock contention penalty.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     

24 Mar, 2014

2 commits

  • This adds rcu protected access to a queue in the nvme IOCTL path
    to fix potential races between a surprise removal and queue usage in
    nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
    the nvme_queue from freeing while this path is executing so it can't
    sleep, and so this path will no longer wait for a available command
    id should they all be in use at the time a passthrough IOCTL request
    is received.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • This adds rcu protected access to nvme_queue to fix a race between a
    surprise removal freeing the queue and a thread with open reference on
    a NVMe block device using that queue.

    The queues do not need to be rcu protected during the initialization or
    shutdown parts, so I've added a helper function for raw deferencing
    to get around the sparse errors.

    There is still a hole in the IOCTL path for the same problem, which is
    fixed in a subsequent patch.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     

07 Mar, 2014

1 commit

  • PREPARE_[DELAYED_]WORK() are being phased out. They have few users
    and a nasty surprise in terms of reentrancy guarantee as workqueue
    considers work items to be different if they don't have the same work
    function.

    nvme_dev->reset_work is multiplexed with multiple work functions.
    Introduce nvme_reset_workfn() which invokes nvme_dev->reset_workfn and
    always use it as the work function and update the users to set the
    ->reset_workfn field instead of overriding the work function using
    PREPARE_WORK().

    It would probably be best to route this with other related updates
    through the workqueue tree.

    Compile tested.

    Signed-off-by: Tejun Heo
    Cc: Matthew Wilcox
    Cc: linux-nvme@lists.infradead.org

    Tejun Heo
     

28 Jan, 2014

2 commits

  • Send nvme abort command to io requests that have timed out on an
    initialized device. If the command is not returned after another timeout,
    schedule the controller for reset.

    Signed-off-by: Keith Busch
    [fix endianness issues]
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • Schedules a controller reset when it indicates it has a failed status. If
    the device does not become ready after a reset, the pci device will be
    scheduled for removal.

    Signed-off-by: Keith Busch
    [fixed checkpatch issue]
    Signed-off-by: Matthew Wilcox

    Keith Busch
     

17 Dec, 2013

2 commits

  • Adds controller error handling on resume power management. If the device
    fails to initialize, the device is queued for a reset. If the reset fails,
    a thread is spawned to remove the pci device.

    If the device resumes as "busy", the device is responding to admin
    commands but will not create IO queues. In this case, we need to remove
    the gendisks and free the IO queues since they can't be used and may be
    holding bios in their lists.

    From testing, the dma pools require a pci device so this had to change
    the pci driver 'remove' to release the dma resources in line with that
    call instead of after all references to the device are released.

    Signed-off-by: Keith Busch
    Signed-off-by: Matthew Wilcox

    Keith Busch
     
  • For 32-bit versions of sg3-utils running on a 64-bit system. This is
    mostly a copy from the relevent portions of fs/compat_ioctl.c, with
    slight modifications for going through block_device_operations.

    Signed-off-by: Keith Busch
    Reviewed-by: Vishal Verma
    [fixed up CONFIG_COMPAT=n build problems]
    Signed-off-by: Matthew Wilcox

    Keith Busch