18 May, 2016

9 commits

  • While doing recent bring-up of nvme/host with target-core T10-PI,
    I noticed /sys/block/nvme*/integrity/device_is_integrity_capable
    was false, and /sys/block/nvme*/integrity/tag_size contained
    a bogus value.

    AFAICT outside of blk_integrity_compare() for DM + MD these
    are informational values, but go ahead and add the missing
    assignments for nvme/host to match what SCSI does within
    sd_dif_config_host() for consistency's sake.

    Cc: Keith Busch
    Cc: Jay Freyensee
    Cc: Martin K. Petersen
    Cc: Sagi Grimberg
    Cc: Christoph Hellwig
    Signed-off-by: Nicholas Bellinger
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Nicholas Bellinger
     
  • Adds two Intel controllers that have the "stripe" quirk.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • This patch adds a new state that when set has the core automatically
    kill request queues prior to removing namespaces.

    If PCI device is not present at the time the nvme driver's remove is
    called, we can kill all IO queues immediately instead of waiting for
    the watchdog thread to do that at its polling interval. This improves
    scenarios where multiple hot plug events occur at the same time since
    it doesn't block the pci enumeration for as long.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • This exposes ioctl and sysfs methods a user can invoke to request the
    driver rescan a controller and its namespaces. This is less harsh than
    doing a controller reset, which temporarilly halts all IO, just to
    surface a newly attached namespace.

    This is mainly useful for controllers that implement the namespace
    management command, but do not support the namespace notify change
    asynchronous event notification.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Reduce error logging when no corrective action is required.

    Suggessted-by: Chris Petersen
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Instead of removing the PCI device from the kernel's topology on
    controller failure, this patch simply requests unbinding the device
    from the driver. This avoids concurrently running pci removal with the
    hot plug event, which has been reported to be problematic when multiple
    surprise events occur near simultaneously.

    The other benefit is that we will have PCI config and memory space
    available to poke around for debugging a failed controller, assuming
    the device was not physically removed.

    The down side occurs if the platform and/or kernel do not support any
    type of surprise hot removal. The device will remain visible through
    sysfs (and therefore lspci), and some manual work is necessary to get
    the logical topology corrected. But if your platform and/or kernel don't
    support surprise removal, you probably shouldn't be doing that anyway.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Use the online queue count instead of the number of allocated queues. The
    controller should just return an invalid queue identifier error to the
    commands if a queue wasn't created. While it's not harmful, it's still
    not correct.

    Reported-by: Saar Gross
    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • The driver previously requested allocating queues for the total possible
    number of CPUs so that blk-mq could rebalance these if CPUs were added
    after initialization. The number of hardware contexts can now be changed
    at runtime, so we only need to allocate the number of online queues
    since we can add more later.

    Suggested-by: Jeff Lien
    Signed-off-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Pull block driver updates from Jens Axboe:
    "On top of the core pull request, this is the drivers pull request for
    this merge window. This contains:

    - Switch drivers to the new write back cache API, and kill off the
    flush flags. From me.

    - Kill the discard support for the STEC pci-e flash driver. It's
    trivially broken, and apparently unmaintained, so it's safer to
    just remove it. From Jeff Moyer.

    - A set of lightnvm updates from the usual suspects (Matias/Javier,
    and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
    Tao.

    - A set of updates for NVMe:

    - Turn the controller state management into a proper state
    machine. From Christoph.

    - Shuffling of code in preparation for NVMe-over-fabrics, also
    from Christoph.

    - Cleanup of the command prep part from Ming Lin.

    - Rewrite of the discard support from Ming Lin.

    - Deadlock fix for namespace removal from Ming Lin.

    - Use the now exported blk-mq tag helper for IO termination.
    From Sagi.

    - Various little fixes from Christoph, Guilherme, Keith, Ming
    Lin, Wang Sheng-Hui.

    - Convert mtip32xx to use the now exported blk-mq tag iter function,
    from Keith"

    * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
    lightnvm: reserved space calculation incorrect
    lightnvm: rename nr_pages to nr_ppas on nvm_rq
    lightnvm: add is_cached entry to struct ppa_addr
    lightnvm: expose gennvm_mark_blk to targets
    lightnvm: remove mgt targets on mgt removal
    lightnvm: pass dma address to hardware rather than pointer
    lightnvm: do not assume sequential lun alloc.
    nvme/lightnvm: Log using the ctrl named device
    lightnvm: rename dma helper functions
    lightnvm: enable metadata to be sent to device
    lightnvm: do not free unused metadata on rrpc
    lightnvm: fix out of bound ppa lun id on bb tbl
    lightnvm: refactor set_bb_tbl for accepting ppa list
    lightnvm: move responsibility for bad blk mgmt to target
    lightnvm: make nvm_set_rqd_ppalist() aware of vblks
    lightnvm: remove struct factory_blks
    lightnvm: refactor device ops->get_bb_tbl()
    lightnvm: introduce nvm_for_each_lun_ppa() macro
    lightnvm: refactor dev->online_target to global nvm_targets
    lightnvm: rename nvm_targets to nvm_tgt_type
    ...

    Linus Torvalds
     

07 May, 2016

8 commits

  • The number of ppas contained on a request is not necessarily the number
    of pages that it maps to neither on the target nor on the device side.
    In order to avoid confusion, rename nr_pages to nr_ppas since it is what
    the variable actually contains.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • A recent change to lightnvm added code to pass a kernel pointer
    to the hardware, which gcc complained about:

    drivers/nvme/host/lightnvm.c: In function 'nvme_nvm_rqtocmd':
    drivers/nvme/host/lightnvm.c:472:32: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
    c->ph_rw.metadata = cpu_to_le64(rqd->meta_list);

    It looks like this has no way of working anyway, so this changes
    the code to pass the dma_address instead. This was most likely
    what was intended here. Neither of the two are currently ever
    written to, so the effect is the same for now.

    Signed-off-by: Arnd Bergmann
    Fixes: a34b1eb78e21 ("lightnvm: enable metadata to be sent to device")
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • Align with the rest of the nvme subsystem.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • Until now, the dma pool have been exclusively used to allocate the ppa
    list being sent to the device. In pblk (upcoming), we use these pools to
    allocate metadata too. Thus, we generalize the names of some variables
    on the dma helper functions to make the code more readable.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Enable metadata buffer to be sent to the device through the metadata
    field on the physical rw nvme command. The size of the metadata buffer
    must follow dev->oob_size * # of PPAs.

    Signed-off-by: Javier González
    Updated description.
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • The set_bb_tbl takes struct nvm_rq and only uses its ppa_list and
    nr_pages internally. Instead, make these two variables explicit.
    This allows a user to call it without initializing a struct nvm_rq
    first.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The device ops->get_bb_tbl() takes a callback, that allows the caller
    to use its own callback function to update its data structures in the
    returning function.

    This makes it difficult to send parameters to the callback, and usually
    is circumvented by small private structures, that both carry the callers
    state and any flags needed to fulfill the update.

    Refactor ops->get_bb_tbl() to fill a data buffer with the status of the
    blocks returned, and let the user call the callback function manually.
    That will provide the necessary flags and data structures and simplify
    the logic around ops->get_bb_tbl().

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • The get block table command returns a list of blocks and planes
    with their associated state. Users, such as gennvm and sysblk,
    manages all planes as a single virtual block.

    It was therefore natural to fold the bad block list before it is
    returned. However, to allow users, which manages on a per-plane
    block level, to also use the interface, the get_bb_tbl interface is
    changed to not fold by default and instead let the caller fold if
    necessary.

    Reviewed by: Johannes Thumshirn
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     

04 May, 2016

1 commit

  • This fixes a scenario where device is present and being reset, but a
    request to unbind the driver occurs.

    A previous patch series addressing a device failure removal scenario
    flushed reset_work after controller disable to unblock reset_work waiting
    on a completion that wouldn't occur. This isn't safe as-is. The broken
    scenario can potentially be induced with:

    modprobe nvme && modprobe -r nvme

    To fix, the reset work is flushed immediately after setting the controller
    removing flag, and any subsequent reset will not proceed with controller
    initialization if the flag is set.

    The controller status must be polled while active, so the watchdog timer
    is also left active until the controller is disabled to cleanup requests
    that may be stuck during namespace removal.

    [Fixes: ff23a2a15a2117245b4599c1352343c8b8fb4c43]
    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Keith Busch
     

02 May, 2016

10 commits

  • On receipt of a namespace attribute changed AER, we acquire the
    namespace mutex lock before proceeding to scan and validate the
    namespace list. In case of namespace detach/delete command,
    nvme_ns_remove function deadlocks trying to acquire the already held
    lock.

    All callers, except nvme_remove_namespaces(), of nvme_ns_remove()
    already held namespaces_mutex. So we can simply fix the deadlock by
    not acquiring the mutex in nvme_ns_remove() and acquiring it in
    nvme_remove_namespaces().

    Reported-by: Sunad Bhandary S
    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimerg
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • Switch to RCU freeing the namespace structure so that
    nvme_start_queues, nvme_stop_queues and nvme_kill_queues would
    be able to get away with only a RCU read side critical section.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimerg
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • This hides command cleanup into nvme.h and fabrics drivers will
    also use it.

    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • The transport driver still needs to do the actual submission, but all the
    higher level code can be shared.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Move the scan work item and surrounding code to the common code. For now
    we need a new finish_scan method to allow the PCI driver to set the
    irq affinity hints, but I have plans in the works to obsolete this as well.

    Note that this moves the namespace scanning from nvme_wq to the system
    workqueue, but as we don't rely on namespace scanning to finish from reset
    or I/O this should be fine.

    Signed-off-by: Christoph Hellwig
    Acked-by Jon Derrick:
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only should be scanning namespaces if the controller is live. Currently
    we call the function just before setting it live, so fix the code up to
    move the call to nvme_queue_scan to just below the state change.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Acked-by Jon Derrick:
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Replace the adhoc flags in the PCI driver with a state machine in the
    core code. Based on code from Sagi Grimberg for the Fabrics driver.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Acked-by Jon Derrick:
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • It's unused since "NVMe: Move error handling to failed reset handler".

    Signed-off-by: Christoph Hellwig
    Acked-by: Jon Derrick
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • nvme_core_init does:
    1) register_blkdev
    2) __register_chrdev
    3) class_create

    nvme_core_exit should do cleanup in the reverse order.

    Signed-off-by: Wang Sheng-Hui
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Wang Sheng-Hui
     
  • If the controller fails and is degraded after a reset, we need to kill
    off all requests queues before removing the inaccessble namespaces. This
    will prevent del_gendisk from syncing dirty data, which we can't due
    from a WQ_MEM_RECLAIM work queue.

    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

26 Apr, 2016

2 commits


15 Apr, 2016

1 commit

  • Multiple users have reported device initialization failure due the driver
    not receiving legacy PCI interrupts. This is not unique to any particular
    controller, but has been observed on multiple platforms.

    There have been no issues reported or observed when with message signaled
    interrupts, so this patch attempts to use MSI-x during initialization,
    falling back to MSI. If that fails, legacy would become the default.

    The setup_io_queues error handling had to change as a result: the admin
    queue's msix_entry used to be initialized to the legacy IRQ. The case
    where nr_io_queues is 0 would fail request_irq when setting up the admin
    queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving
    the admin queue's msix_entry invalid. Instead, return success immediately.

    Reported-by: Tim Muhlemmer
    Reported-by: Jon Derrick
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

13 Apr, 2016

9 commits

  • This patch adds a check on nvme_watchdog_timer() function to avoid the
    call to reset_work() when an error recovery process is ongoing on
    controller. The check is made by looking at pci_channel_offline()
    result.

    If we don't check for this on nvme_watchdog_timer(), error recovery
    mechanism can't recover well, because reset_work() won't be able to
    do its job (since we're in the middle of an error) and so the
    controller is removed from the system before error recovery mechanism
    can perform slot reset (which would allow the adapter to recover).

    In this patch we also have split the huge condition expression on
    nvme_watchdog_timer() by introducing an auxiliary function to help
    make the code more readable.

    Reviewed-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Jens Axboe

    Guilherme G. Piccoli
     
  • Depending on options, we might not be using dev in nvme_cancel_io():

    drivers/nvme/host/pci.c: In function ‘nvme_cancel_io’:
    drivers/nvme/host/pci.c:970:19: warning: unused variable ‘dev’ [-Wunused-variable]
    struct nvme_dev *dev = data;
    ^

    So get rid of it, and just cast for the dev_dbg_ratelimited() call.

    Fixes: 82b4552b91c4 ("nvme: Use blk-mq helper for IO termination")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • blk-mq offers a tagset iterator so let's use that
    instead of using nvme_clear_queues.

    Note, we changed nvme_queue_cancel_ios name to nvme_cancel_io
    as there is no concept of a queue now in this function (we
    also lost the print).

    Signed-off-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Acked-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • If the controller is degraded, the driver should stay out of the way so
    the user can recover the drive. This patch skips driver initiated async
    event requests when the drive is in this state.

    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • This moves nvme_setup_{flush,discard,rw} calls into a common
    nvme_setup_cmd() helper. So we can eventually hide all the command
    setup in the core module and don't even need to update the fabrics
    drivers for any specific command type.

    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • This rewrites nvme_setup_discard() with blk_add_request_payload().
    It allocates only the necessary amount(16 bytes) for the payload.

    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • The helper returns the number of bytes that need to be mapped
    using PRPs/SGL entries.

    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lin
     
  • When unloading driver, nvme_disable_io_queues() calls nvme_delete_queue()
    that sends nvme_admin_delete_cq command to admin sq. So when the command
    completed, the lock acquired by nvme_irq() actually belongs to admin queue.

    While the lock that nvme_del_cq_end() trying to acquire belongs to io queue.
    So it will not deadlock.

    This patch adds lock nesting notation to fix following report.

    [ 109.840952] =============================================
    [ 109.846379] [ INFO: possible recursive locking detected ]
    [ 109.851806] 4.5.0+ #180 Tainted: G E
    [ 109.856533] ---------------------------------------------
    [ 109.861958] swapper/0/0 is trying to acquire lock:
    [ 109.866771] (&(&nvmeq->q_lock)->rlock){-.....}, at: [] nvme_del_cq_end+0x26/0x70 [nvme]
    [ 109.876535]
    [ 109.876535] but task is already holding lock:
    [ 109.882398] (&(&nvmeq->q_lock)->rlock){-.....}, at: [] nvme_irq+0x1b/0x50 [nvme]
    [ 109.891547]
    [ 109.891547] other info that might help us debug this:
    [ 109.898107] Possible unsafe locking scenario:
    [ 109.898107]
    [ 109.904056] CPU0
    [ 109.906515] ----
    [ 109.908974] lock(&(&nvmeq->q_lock)->rlock);
    [ 109.913381] lock(&(&nvmeq->q_lock)->rlock);
    [ 109.917787]
    [ 109.917787] *** DEADLOCK ***
    [ 109.917787]
    [ 109.923738] May be due to missing lock nesting notation
    [ 109.923738]
    [ 109.930558] 1 lock held by swapper/0/0:
    [ 109.934413] #0: (&(&nvmeq->q_lock)->rlock){-.....}, at: [] nvme_irq+0x1b/0x50 [nvme]
    [ 109.944010]
    [ 109.944010] stack backtrace:
    [ 109.948389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 4.5.0+ #180
    [ 109.955734] Hardware name: Dell Inc. OptiPlex 7010/0YXT71, BIOS A15 08/12/2013
    [ 109.962989] 0000000000000000 ffff88011e203c38 ffffffff81383d9c ffffffff81c13540
    [ 109.970478] ffffffff826711d0 ffff88011e203ce8 ffffffff810bb429 0000000000000046
    [ 109.977964] 0000000000000046 0000000000000000 0000000000b2e597 ffffffff81f4cb00
    [ 109.985453] Call Trace:
    [ 109.987911] [] dump_stack+0x85/0xc9
    [ 109.993711] [] __lock_acquire+0x19b9/0x1c60
    [ 109.999575] [] ? trace_hardirqs_off+0xd/0x10
    [ 110.005524] [] ? complete+0x3d/0x50
    [ 110.010688] [] lock_acquire+0x90/0xf0
    [ 110.016029] [] ? nvme_del_cq_end+0x26/0x70 [nvme]
    [ 110.022418] [] _raw_spin_lock_irqsave+0x4b/0x60
    [ 110.028632] [] ? nvme_del_cq_end+0x26/0x70 [nvme]
    [ 110.035019] [] nvme_del_cq_end+0x26/0x70 [nvme]
    [ 110.041232] [] blk_mq_end_request+0x35/0x60
    [ 110.047095] [] nvme_complete_rq+0x68/0x190 [nvme]
    [ 110.053481] [] __blk_mq_complete_request+0x8f/0x130
    [ 110.060043] [] blk_mq_complete_request+0x31/0x40
    [ 110.066343] [] __nvme_process_cq+0x83/0x240 [nvme]
    [ 110.072818] [] nvme_irq+0x25/0x50 [nvme]
    [ 110.078419] [] handle_irq_event_percpu+0x36/0x110
    [ 110.084804] [] handle_irq_event+0x37/0x60
    [ 110.090491] [] handle_edge_irq+0x93/0x150
    [ 110.096180] [] handle_irq+0xa6/0x130
    [ 110.101431] [] do_IRQ+0x5e/0x120
    [ 110.106333] [] common_interrupt+0x8c/0x8c

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Ming Lin
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Ming Lin