20 Jan, 2017

1 commit

  • commit b5a10c5f7532b7473776da87e67f8301bbc32693 upstream.

    Commit 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
    readiness") introduced a quirk to adapters that cannot read the bit
    NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
    need a delay or else the action of reading the bit NVME_CSTS_RDY could
    somehow corrupt adapter's registers state and it never recovers.

    When this quirk was added, we checked ctrl->tagset in order to avoid
    quirking in probe time, supposing we would never require such delay
    during probe. Well, it was too optimistic; we in fact need this quirk
    at probe time in some cases, like after a kexec.

    In some experiments, after abnormal shutdown of machine (aka power cord
    unplug), we booted into our bootloader in Power, which is a Linux kernel,
    and kexec'ed into another distro. If this kexec is too quick, we end up
    reaching the probe of NVMe adapter in that distro when adapter is in
    bad state (not fully initialized on our bootloader). What happens next
    is that nvme_wait_ready() is unable to complete, except if the quirk is
    enabled.

    So, this patch removes the original ctrl->tagset verification in order
    to enable the quirk even on probe time.

    Fixes: 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter readiness")
    Reported-by: Andrew Byrne
    Reported-by: Jaime A. H. Gomez
    Reported-by: Zachary D. Myers
    Signed-off-by: Guilherme G. Piccoli
    Acked-by: Jeffrey Lien
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Guilherme G. Piccoli
     

06 Jan, 2017

1 commit

  • commit e4fcf07cca6a3b6c4be00df16f08be894325eaa3 upstream.

    When removing a namespace we delete it from the subsystem namespaces
    list with list_del_init which allows us to know if it is enabled or
    not.

    The problem is that list_del_init initialize the list next and does
    not respect the RCU list-traversal we do on the IO path for locating
    a namespace. Instead we need to use list_del_rcu which is allowed to
    run concurrently with the _rcu list-traversal primitives (keeps list
    next intact) and guarantees concurrent nvmet_find_naespace forward
    progress.

    By changing that, we cannot rely on ns->dev_link for knowing if the
    namspace is enabled, so add enabled indicator entry to nvmet_ns for
    that.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Solganik Alexander
    Signed-off-by: Greg Kroah-Hartman

    Solganik Alexander
     

17 Nov, 2016

1 commit

  • The nvme_remove function tears down all allocated resources in the correct
    order, so no need to free queues on error during initialization. This
    fixes possible use-after-free errors when queues are still associated
    with a blk-mq hctx.

    Reported-by: Scott Bauer
    Tested-by: Scott Bauer
    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     

14 Nov, 2016

6 commits

  • draining the qp right after disconnect might not suffice because
    the nvmet sq is not fully drained (in nvmet_sq_destroy) and we might
    see completions after the drain. Instead, drain right before the
    qp destroy which comes after the sq destruction and we can be sure
    that no posts come after the drain.

    Tested-by: Steve Wise
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • While testing nvme-rdma with the spdk nvmf target over iw_cxgb4, I
    configured the target (mistakenly) to generate an error creating the
    NVMF IO queues. This resulted a "Invalid SQE Parameter" error sent back
    to the host on the first IO queue connect:

    [ 9610.928182] nvme nvme1: queue_size 128 > ctrl maxcmd 120, clamping down
    [ 9610.938745] nvme nvme1: creating 32 I/O queues.

    So nvmf_connect_io_queue() returns an error to
    nvmf_connect_io_queue() / nvmf_connect_io_queues(), and that
    is returned to nvme_rdma_create_io_queues(). In the error path,
    nvmf_rdma_create_io_queues() frees the queue tagset memory _before_
    stopping and freeing the IB queues, which causes yet another
    touch-after-free crash due to SQ CQEs being flushed after the ib_cqe
    structs pointed-to by the flushed WRs have been freed (since they are
    part of the nvme_rdma_request struct).

    The fix is to stop and free the queues in nvmf_connect_io_queues()
    if there is an error connecting any of the queues.

    Signed-off-by: Steve Wise
    Signed-off-by: Sagi Grimberg

    Steve Wise
     
  • In case we accepted a queue connection and it failed, we might not
    remove the queue from the list until we unload and clean it up.
    We should delete it from the queue list on the relevant handler.

    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • In the transport, in case of an interal queue error like
    error completion in rdma we trigger a fatal error. However,
    multiple queues in the same controller can serr error completions
    and we don't want to trigger fatal error work more than once.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • If we reconncect we might have command queue up that get resent as soon
    as the queue is restarted. But until the connect command succeeded we
    can't send other command. Add a new flag that marks a queue as live when
    connect finishes, and delay any non-connect command until the queue is
    live based on it.

    Signed-off-by: Christoph Hellwig
    Reported-by: Steve Wise
    Tested-by: Steve Wise
    [sagig: fixes admin queue LIVE setting]
    Signed-off-by: Sagi Grimberg

    Christoph Hellwig
     
  • When we initiate queue teardown sequence we call rdma_destroy_qp
    which clears cm_id->qp, afterwards we call rdma_destroy_id, but
    we might see a rdma_cm event in between with a cleared cm_id->qp
    so watch out for that and silently ignore the event because this
    means that the queue teardown sequence is in progress.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Sagi Grimberg

    Bart Van Assche
     

12 Nov, 2016

1 commit

  • The ns->lba_shift assumes its value to be the logarithmic of the
    LA size. A previous patch duplicated the lba_shift calculation into
    lightnvm. It prematurely also subtracted a 512byte shift, which commonly
    is applied per-command. The 512byte shift being subtracted twice led to
    data loss when restoring the logical to physical mapping table from
    device and when issuing I/O commands using rrpc.

    Fix offset by removing the 512byte shift subtraction when calculating
    lba_shift.

    Fixes: b0b4e09c1ae7 "lightnvm: control life of nvm_dev in driver"
    Reported-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

22 Oct, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "A set of fixes that missed the merge window, mostly due to me being
    away around that time.

    Nothing major here, a mix of nvme cleanups and fixes, and one fix for
    the badblocks handling"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvmet: use symbolic constants for CNS values
    nvme: use symbolic constants for CNS values
    nvme.h: add an enum for cns values
    nvme.h: don't use uuid_be
    nvme.h: resync with nvme-cli
    nvme: Add tertiary number to NVME_VS
    nvme : Add sysfs entry for NVMe CMBs when appropriate
    nvme: don't schedule multiple resets
    nvme: Delete created IO queues on reset
    nvme: Stop probing a removed device
    badblocks: fix overlapping check for clearing

    Linus Torvalds
     

20 Oct, 2016

4 commits


13 Oct, 2016

1 commit

  • Add a sysfs attribute that contains salient information about the NVMe
    Controller Memory Buffer when one is present. For now, just display the
    information about the CMB available from the control registers. We attach
    the CMB attribute file to the existing nvme_ctrl sysfs group so it can
    handle the sysfs teardown.

    Reviewed-by: Sagi Grimberg
    Reviewed-by: Jay Freyensee
    Signed-off-by: Stephen Bates
    Acked-by Jon Derrick:
    Signed-off-by: Jens Axboe

    Stephen Bates
     

12 Oct, 2016

4 commits

  • The queue_work only fails if the work is pending, but not yet running. If
    the work is running, the work item would get requeued, triggering a
    double reset. If the first reset fails for any reason, the second
    reset triggers:

    WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

    Hitting that schedules controller deletion for a second time, which
    potentially takes a reference on the device that is being deleted.
    If the reset occurs at the same time as a hot removal event, this causes
    a double-free.

    This patch has the reset helper function check if the work is busy
    prior to queueing, and changes all places that schedule resets to use
    this function. Since most users don't want to sync with that work, the
    "flush_work" is moved to the only caller that wants to sync.

    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • The driver was decrementing the online_queues prior to attempting to
    delete those IO queues, so the driver ended up not requesting the
    controller delete any. This patch saves the online_queues prior to
    suspending them, and adds that parameter for deleting io queues.

    Fixes: c21377f8 ("nvme: Suspend all queues before deletion")
    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • There is no reason the nvme controller can ever return all 1's from
    reading the CSTS register. This patch returns an error if we observe
    that status. Without this, we may incorrectly proceed with controller
    initialization and unnecessarilly rely on error handling to clean this.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
    driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).

    Link: http://lkml.kernel.org/r/1470092390-25451-4-git-send-email-mauricfo@linux.vnet.ibm.com
    Signed-off-by: Mauricio Faria de Oliveira
    Reviewed-by: Gabriel Krisman Bertazi
    Cc: Keith Busch
    Cc: Jens Axboe
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mauricio Faria de Oliveira
     

10 Oct, 2016

2 commits

  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     
  • Pull main rdma updates from Doug Ledford:
    "This is the main pull request for the rdma stack this release. The
    code has been through 0day and I had it tagged for linux-next testing
    for a couple days.

    Summary:

    - updates to mlx5

    - updates to mlx4 (two conflicts, both minor and easily resolved)

    - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
    proper resolution is to keep the code in cxgb4_main.c as it is in
    Linus' tree as attach_uld was refactored and moved into
    cxgb4_uld.c)

    - improvements to uAPI (moved vendor specific API elements to uAPI
    area)

    - add hns-roce driver and hns and hns-roce ACPI reset support

    - conversion of all rdma code away from deprecated
    create_singlethread_workqueue

    - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
    staging)"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
    staging/lustre: Disable InfiniBand support
    iw_cxgb4: add fast-path for small REG_MR operations
    cxgb4: advertise support for FR_NSMR_TPTE_WR
    IB/core: correctly handle rdma_rw_init_mrs() failure
    IB/srp: Fix infinite loop when FMR sg[0].offset != 0
    IB/srp: Remove an unused argument
    IB/core: Improve ib_map_mr_sg() documentation
    IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
    IB/mthca: Move user vendor structures
    IB/nes: Move user vendor structures
    IB/ocrdma: Move user vendor structures
    IB/mlx4: Move user vendor structures
    IB/cxgb4: Move user vendor structures
    IB/cxgb3: Move user vendor structures
    IB/mlx5: Move and decouple user vendor structures
    IB/{core,hw}: Add constant for node_desc
    ipoib: Make ipoib_warn ratelimited
    IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
    IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
    IB/ipoib: Remove deprecated create_singlethread_workqueue
    ...

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block layer changes in 4.9.

    As mentioned at the last merge window, I've changed things up and now
    do just one branch for core block layer changes, and driver changes.
    This avoids dependencies between the two branches. Outside of this
    main pull request, there are two topical branches coming as well.

    This pull request contains:

    - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

    - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
    Followup dependency fix from Geert.

    - General fixes from Bart, Baoyou, Guoqing, and Linus W.

    - CFQ async write starvation fix from Glauber.

    - Add supprot for delayed kick of the requeue list, from Mike.

    - Pull out the scalable bitmap code from blk-mq-tag.c and make it
    generally available under the name of sbitmap. Only blk-mq-tag uses
    it for now, but the blk-mq scheduling bits will use it as well.
    From Omar.

    - bdev thaw error progagation from Pierre.

    - Improve the blk polling statistics, and allow the user to clear
    them. From Stephen.

    - Set of minor cleanups from Christoph in block/blk-mq.

    - Set of cleanups and optimizations from me for block/blk-mq.

    - Various nvme/nvmet/nvmeof fixes from the various folks"

    * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
    fs/block_dev.c: return the right error in thaw_bdev()
    nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
    nvme/scsi: Remove power management support
    nvmet: Make dsm number of ranges zero based
    nvmet: Use direct IO for writes
    admin-cmd: Added smart-log command support.
    nvme-fabrics: Add host_traddr options field to host infrastructure
    nvme-fabrics: revise host transport option descriptions
    nvme-fabrics: rework nvmf_get_address() for variable options
    nbd: use BLK_MQ_F_BLOCKING
    blkcg: Annotate blkg_hint correctly
    cfq: fix starvation of asynchronous writes
    blk-mq: add flag for drivers wanting blocking ->queue_rq()
    blk-mq: remove non-blocking pass in blk_mq_map_request
    blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
    block: export bio_free_pages to other modules
    lightnvm: propagate device_add() error code
    lightnvm: expose device geometry through sysfs
    lightnvm: control life of nvm_dev in driver
    blk-mq: register device instead of disk
    ...

    Linus Torvalds
     

25 Sep, 2016

2 commits

  • Any user I can imagine that needs a buffer at all will want to pass
    a pointer directly. There are no currently callers that use
    buffers, so this change is painless, and it will make it much easier
    to start using features that use buffers (e.g. APST).

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Christoph Hellwig
    Acked-by: Jay Freyensee
    Tested-by: Jay Freyensee
    Signed-off-by: Jens Axboe

    Andy Lutomirski
     
  • As far as I can tell, there is basically nothing correct about this
    code. It misinterprets npss (off-by-one). It hardcodes a bunch of
    power states, which is nonsense, because they're all just indices
    into a table that software needs to parse. It completely ignores
    the distinction between operational and non-operational states.
    And, until 4.8, if all of the above magically succeeded, it would
    dereference a NULL pointer and OOPS.

    Since this code appears to be useless, just delete it.

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Christoph Hellwig
    Acked-by: Jay Freyensee
    Tested-by: Jay Freyensee
    Signed-off-by: Jens Axboe

    Andy Lutomirski
     

24 Sep, 2016

8 commits


23 Sep, 2016

1 commit


21 Sep, 2016

3 commits

  • For a host to access an Open-Channel SSD, it has to know its geometry,
    so that it writes and reads at the appropriate device bounds.

    Currently, the geometry information is kept within the kernel, and not
    exported to user-space for consumption. This patch exposes the
    configuration through sysfs and enables user-space libraries, such as
    liblightnvm, to use the sysfs implementation to get the geometry of an
    Open-Channel SSD.

    The sysfs entries are stored within the device hierarchy, and can be
    found using the "lightnvm" device type.

    An example configuration looks like this:

    /sys/class/nvme/
    └── nvme0n1
    ├── capabilities: 3
    ├── device_mode: 1
    ├── erase_max: 1000000
    ├── erase_typ: 1000000
    ├── flash_media_type: 0
    ├── media_capabilities: 0x00000001
    ├── media_type: 0
    ├── multiplane: 0x00010101
    ├── num_blocks: 1022
    ├── num_channels: 1
    ├── num_luns: 4
    ├── num_pages: 64
    ├── num_planes: 1
    ├── page_size: 4096
    ├── prog_max: 100000
    ├── prog_typ: 100000
    ├── read_max: 10000
    ├── read_typ: 10000
    ├── sector_oob_size: 0
    ├── sector_size: 4096
    ├── media_manager: gennvm
    ├── ppa_format: 0x380830082808001010102008
    ├── vendor_opcode: 0
    ├── max_phys_secs: 64
    └── version: 1

    Signed-off-by: Simon A. F. Lund
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Simon A. F. Lund
     
  • LightNVM compatible device drivers does not have a method to expose
    LightNVM specific sysfs entries.

    To enable LightNVM sysfs entries to be exposed, lightnvm device
    drivers require a struct device to attach it to. To allow both the
    actual device driver and lightnvm sysfs entries to coexist, the device
    driver tracks the lifetime of the nvm_dev structure.

    This patch refactors NVMe and null_blk to handle the lifetime of struct
    nvm_dev, which eliminates the need for struct gendisk when a lightnvm
    compatible device is provided.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • With LightNVM enabled namespaces, the gendisk structure is not exposed
    to the user. This prevents LightNVM users from accessing the NVMe device
    driver specific sysfs entries, and LightNVM namespace geometry.

    Refactor the revalidation process, so that a namespace, instead of a
    gendisk, is revalidated. This later allows patches to wire up the
    sysfs entries up to a non-gendisk namespace.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

16 Sep, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "A set of fixes for the current series in the realm of block.

    Like the previous pull request, the meat of it are fixes for the nvme
    fabrics/target code. Outside of that, just one fix from Gabriel for
    not doing a queue suspend if we didn't get the admin queue setup in
    the first place"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvme-rdma: add back dependency on CONFIG_BLOCK
    nvme-rdma: fix null pointer dereference on req->mr
    nvme-rdma: use ib_client API to detect device removal
    nvme-rdma: add DELETING queue flag
    nvme/quirk: Add a delay before checking device ready for memblaze device
    nvme: Don't suspend admin queue that wasn't created
    nvme-rdma: destroy nvme queue rdma resources on connect failure
    nvme_rdma: keep a ref on the ctrl during delete/flush
    iw_cxgb4: block module unload until all ep resources are released
    iw_cxgb4: call dev_put() on l2t allocation failure

    Linus Torvalds
     

15 Sep, 2016

2 commits