21 Mar, 2020

1 commit

  • When we send PDU data, we want to optimize the tcp stack
    operation if we have more data to send. So when we set MSG_MORE
    when:
    - We have more fragments coming in the batch, or
    - We have a more data to send in this PDU
    - We don't have a data digest trailer
    - We optimize with the SUCCESS flag and omit the NVMe completion
    (used if sq_head pointer update is disabled)

    This addresses a regression in QD=1 with SUCCESS flag optimization
    as we unconditionally set MSG_MORE when we didn't actually have
    more data to send.

    Fixes: 70583295388a ("nvmet-tcp: implement C2HData SUCCESS optimization")
    Reported-by: Mark Wunderlich
    Tested-by: Mark Wunderlich
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Keith Busch

    Sagi Grimberg
     

11 Mar, 2020

1 commit

  • The timeout of identify cmd, which is invoked as part of admin queue
    creation, can result in freeing of async event data both in
    nvme_rdma_timeout handler and error handling path of
    nvme_rdma_configure_admin queue thus causing NULL pointer reference.
    Call Trace:
    ? nvme_rdma_setup_ctrl+0x223/0x800 [nvme_rdma]
    nvme_rdma_create_ctrl+0x2ba/0x3f7 [nvme_rdma]
    nvmf_dev_write+0xa54/0xcc6 [nvme_fabrics]
    __vfs_write+0x1b/0x40
    vfs_write+0xb2/0x1b0
    ksys_write+0x61/0xd0
    __x64_sys_write+0x1a/0x20
    do_syscall_64+0x60/0x1e0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reviewed-by: Roland Dreier
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Prabhath Sajeepa
    Signed-off-by: Keith Busch

    Prabhath Sajeepa
     

28 Feb, 2020

1 commit

  • Completions need to consumed in the same order the controller submitted
    them, otherwise future completion entries may overwrite ones we haven't
    handled yet. Hold the nvme queue's poll lock while completing new CQEs to
    prevent another thread from freeing command tags for reuse out-of-order.

    Fixes: dabcefab45d3 ("nvme: provide optimized poll function for separate poll queues")
    Signed-off-by: Bijan Mottahedeh
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch

    Bijan Mottahedeh
     

23 Feb, 2020

1 commit


21 Feb, 2020

1 commit

  • kmemleak reports a memory leak with the ana_log_buf allocated by
    nvme_mpath_init():

    unreferenced object 0xffff888120e94000 (size 8208):
    comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmalloc_order+0x97/0xc0
    [] kmalloc_order_trace+0x24/0x100
    [] __kmalloc+0x24c/0x2d0
    [] nvme_mpath_init+0x23c/0x2b0
    [] nvme_init_identify+0x75f/0x1600
    [] nvme_loop_configure_admin_queue+0x26d/0x280
    [] nvme_loop_create_ctrl+0x2a7/0x710
    [] nvmf_dev_write+0xc66/0x10b9
    [] __vfs_write+0x50/0xa0
    [] vfs_write+0xf3/0x280
    [] ksys_write+0xc6/0x160
    [] __x64_sys_write+0x43/0x50
    [] do_syscall_64+0x77/0x2f0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    nvme_mpath_init() is called by nvme_init_identify() which is called in
    multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
    means nvme_mpath_init() may be called multiple times before
    nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).

    When nvme_mpath_init() is called multiple times, it overwrites the
    ana_log_buf pointer with a new allocation, thus leaking the previous
    allocation.

    To fix this, free ana_log_buf before allocating a new one.

    Fixes: 0d0b660f214dc490 ("nvme: add ANA support")
    Cc:
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Keith Busch

    Logan Gunthorpe
     

20 Feb, 2020

1 commit

  • gcc may detect a false positive on nvme using an unintialized variable
    if setting features fails. Since this is not a fast path, explicitly
    initialize this variable to suppress the warning.

    Reported-by: Arnd Bergmann
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     

19 Feb, 2020

2 commits

  • People reported that old Apple machines are not working properly
    if the non-first IRQ vector is in use.

    Set quirk for that models to limit IRQ to use first vector only.

    Based on original patch by GitHub user npx001.

    Link: https://github.com/Dunedan/mbp-2016-linux/issues/9
    Cc: Benjamin Herrenschmidt
    Cc: Leif Liddy
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Keith Busch

    Andy Shevchenko
     
  • The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo
    C640 platform experience runtime resume issues when the SSDs are kept in
    sleep/suspend mode for long time.

    This patch applies the 'Simple Suspend' quirk to these configurations.
    With this patch, the issue had not been observed in a 1+ day test.

    Reviewed-by: Jon Derrick
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Shyjumon N
    Signed-off-by: Keith Busch

    Shyjumon N
     

17 Feb, 2020

1 commit

  • Pull block fixes from Jens Axboe:
    "Not a lot here, which is great, basically just three small bcache
    fixes from Coly, and four NVMe fixes via Keith"

    * tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block:
    nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
    nvme/pci: move cqe check after device shutdown
    nvme: prevent warning triggered by nvme_stop_keep_alive
    nvme/tcp: fix bug on double requeue when send fails
    bcache: remove macro nr_to_fifo_front()
    bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
    bcache: ignore pending signals when creating gc and allocator thread

    Linus Torvalds
     

15 Feb, 2020

4 commits

  • nvme fw-activate operation will get bellow warning log,
    fix it by update the parameter order

    [ 113.231513] nvme nvme0: Get FW SLOT INFO log error

    Fixes: 0e98719b0e4b ("nvme: simplify the API for getting log pages")
    Reported-by: Sujith Pandel
    Reviewed-by: David Milburn
    Signed-off-by: Yi Zhang
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Yi Zhang
     
  • Many users have reported nvme triggered irq_startup() warnings during
    shutdown. The driver uses the nvme queue's irq to synchronize scanning
    for completions, and enabling an interrupt affined to only offline CPUs
    triggers the alarming warning.

    Move the final CQE check to after disabling the device and all
    registered interrupts have been torn down so that we do not have any
    IRQ to synchronize.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Delayed keep alive work is queued on system workqueue and may be cancelled
    via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

    Check_flush_dependency detects mismatched attributes between the work-queue
    context used to cancel the keep alive work and system-wq. Specifically
    system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
    to cancel keep alive work have WQ_MEM_RECLAIM flag.

    Example warning:

    workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
    is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

    To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

    However this creates a secondary concern where work and a request to cancel
    that work may be in the same work queue - namely err_work in the rdma and
    tcp transports, which will want to flush/cancel the keep alive work which
    will now be on nvme_wq.

    After reviewing the transports, it looks like err_work can be moved to
    nvme_reset_wq. In fact that aligns them better with transition into
    RESETTING and performing related reset work in nvme_reset_wq.

    Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.

    Signed-off-by: Nigel Kirkland
    Signed-off-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Nigel Kirkland
     
  • When nvme_tcp_io_work() fails to send to socket due to
    connection close/reset, error_recovery work is triggered
    from nvme_tcp_state_change() socket callback.
    This cancels all the active requests in the tagset,
    which requeues them.

    The failed request, however, was ended and thus requeued
    individually as well unless send returned -EPIPE.
    Another return code to be treated the same way is -ECONNRESET.

    Double requeue caused BUG_ON(blk_queued_rq(rq))
    in blk_mq_requeue_request() from either the individual requeue
    of the failed request or the bulk requeue from
    blk_mq_tagset_busy_iter(, nvme_cancel_request, );

    Signed-off-by: Anton Eidelman
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Anton Eidelman
     

06 Feb, 2020

1 commit

  • Pull more block updates from Jens Axboe:
    "Some later arrivals, but all fixes at this point:

    - bcache fix series (Coly)

    - Series of BFQ fixes (Paolo)

    - NVMe pull request from Keith with a few minor NVMe fixes

    - Various little tweaks"

    * tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits)
    nvmet: update AEN list and array at one place
    nvmet: Fix controller use after free
    nvmet: Fix error print message at nvmet_install_queue function
    brd: check and limit max_part par
    nvme-pci: remove nvmeq->tags
    nvmet: fix dsm failure when payload does not match sgl descriptor
    nvmet: Pass lockdep expression to RCU lists
    block, bfq: clarify the goal of bfq_split_bfqq()
    block, bfq: get a ref to a group when adding it to a service tree
    block, bfq: remove ifdefs from around gets/puts of bfq groups
    block, bfq: extend incomplete name of field on_st
    block, bfq: get extra ref to prevent a queue from being freed during a group move
    block, bfq: do not insert oom queue into position tree
    block, bfq: do not plug I/O for bfq_queues with no proc refs
    bcache: check return value of prio_read()
    bcache: fix incorrect data type usage in btree_flush_write()
    bcache: add readahead cache policy options via sysfs interface
    bcache: explicity type cast in bset_bkey_last()
    bcache: fix memory corruption in bch_cache_accounting_clear()
    xen/blkfront: limit allocated memory size to actual use case
    ...

    Linus Torvalds
     

05 Feb, 2020

3 commits

  • All async events are enqueued via nvmet_add_async_event() which
    updates the ctrl->async_event_cmds[] array and additionally an struct
    nvmet_async_event is added to the ctrl->async_events list.

    Under normal operations the nvmet_async_event_work() updates again
    the ctrl->async_event_cmds and removes the corresponding struct
    nvmet_async_event from the list again. Though nvmet_sq_destroy() could
    be called which calls nvmet_async_events_free() which only updates the
    ctrl->async_event_cmds[] array.

    Add new functions nvmet_async_events_process() and
    nvmet_async_events_free() to process async events, update an array and
    the list.

    When we destroy submission queue after clearing the aen present on
    the ctrl->async list we also loop over ctrl->async_event_cmds[] for
    any requests posted by the host for which we don't have the AEN in
    the ctrl->async_events list by calling nvmet_async_event_process()
    and nvmet_async_events_free().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Daniel Wagner
    [chaitanya.kulkarni@wdc.com
    * Loop over and clear out outstanding requests
    * Update changelog
    ]
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Keith Busch

    Daniel Wagner
     
  • After nvmet_install_queue() sets sq->ctrl calling to nvmet_sq_destroy()
    reduces the controller refcount. In case nvmet_install_queue() fails,
    calling to nvmet_ctrl_put() is done twice (at nvmet_sq_destroy and
    nvmet_execute_io_connect/nvmet_execute_admin_connect) instead of once for
    the queue which leads to use after free of the controller. Fix this by set
    NULL at sq->ctrl in case of a failure at nvmet_install_queue().

    The bug leads to the following Call Trace:

    [65857.994862] refcount_t: underflow; use-after-free.
    [65858.108304] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
    [65858.115557] RIP: 0010:refcount_warn_saturate+0xe5/0xf0
    [65858.208141] Call Trace:
    [65858.211203] nvmet_sq_destroy+0xe1/0xf0 [nvmet]
    [65858.216383] nvmet_rdma_release_queue_work+0x37/0xf0 [nvmet_rdma]
    [65858.223117] process_one_work+0x167/0x370
    [65858.227776] worker_thread+0x49/0x3e0
    [65858.232089] kthread+0xf5/0x130
    [65858.235895] ? max_active_store+0x80/0x80
    [65858.240504] ? kthread_bind+0x10/0x10
    [65858.244832] ret_from_fork+0x1f/0x30
    [65858.249074] ---[ end trace f82d59250b54beb7 ]---

    Fixes: bb1cc74790eb ("nvmet: implement valid sqhd values in completions")
    Fixes: 1672ddb8d691 ("nvmet: Add install_queue callout")
    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Israel Rukshin
     
  • Place the arguments in the correct order.

    Fixes: 1672ddb8d691 ("nvmet: Add install_queue callout")
    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Israel Rukshin
     

04 Feb, 2020

3 commits

  • There is no real need to have a pointer to the tagset in
    struct nvme_queue, as we only need it in a single place, and that place
    can derive the used tagset from the device and qid trivially. This
    fixes a problem with stale pointer exposure when tagsets are reset,
    and also shrinks the nvme_queue structure. It also matches what most
    other transports have done since day 1.

    Reported-by: Edmund Nadolski
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Christoph Hellwig
     
  • The host is allowed to pass the controller an sgl describing a buffer
    that is larger than the dsm payload itself, allow it when executing
    dsm.

    Reported-by: Dakshaja Uppalapati
    Reviewed-by: Christoph Hellwig ,
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Keith Busch

    Sagi Grimberg
     
  • ctrl->subsys->namespaces and subsys->namespaces are traversed with
    list_for_each_entry_rcu outside an RCU read-side critical section but
    under the protection of ctrl->subsys->lock and subsys->lock respectively.

    Hence, add the corresponding lockdep expression to the list traversal
    primitive to silence false-positive lockdep warnings, and harden RCU
    lists.

    Reported-by: kbuild test robot
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Amol Grover
    Signed-off-by: Keith Busch

    Amol Grover
     

01 Feb, 2020

1 commit

  • This switches the nvme driver to use kelvin_to_millicelsius() and
    millicelsius_to_kelvin() in .

    Link: http://lkml.kernel.org/r/1576386975-7941-8-git-send-email-akinobu.mita@gmail.com
    Signed-off-by: Akinobu Mita
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Guenter Roeck
    Reviewed-by: Keith Busch
    Reviewed-by: Andy Shevchenko
    Cc: Sujith Thomas
    Cc: Darren Hart
    Cc: Andy Shevchenko
    Cc: Zhang Rui
    Cc: Daniel Lezcano
    Cc: Amit Kucheria
    Cc: Jean Delvare
    Cc: Guenter Roeck
    Cc: Keith Busch
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Emmanuel Grumbach
    Cc: Hartmut Knaack
    Cc: Johannes Berg
    Cc: Jonathan Cameron
    Cc: Jonathan Cameron
    Cc: Kalle Valo
    Cc: Lars-Peter Clausen
    Cc: Luca Coelho
    Cc: Peter Meerwald-Stadler
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

28 Jan, 2020

1 commit

  • Pull core block updates from Jens Axboe:
    "This may be the most quiet round we've had in years. I'm not
    complaining. Really not a lot to detail here, outside of spelling and
    documentation improvements/fixes, we have:

    - Allow t10-pi to be modular (Herbert)

    - Remove dead code in bfq (Alex)

    - Mark zone management requests with REQ_SYNC (Chaitanya)

    - BFQ division improvement (Wen)

    - Small series improving plugging (Pavel)"

    * tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block:
    partitions/ldm: fix spelling mistake "to" -> "too"
    block, bfq: improve arithmetic division in bfq_delta()
    block/bfq: remove unused bfq_class_rt which never used
    block: mark zone-mgmt bios with REQ_SYNC
    blk-mq: Document functions for sending request
    block: Allow t10-pi to be modular
    blk-mq: optimise blk_mq_flush_plug_list()
    list: introduce list_for_each_continue()
    blk-mq: optimise rq sort function

    Linus Torvalds
     

10 Jan, 2020

2 commits

  • The existing implementation for the get_feature admin-cmd does not
    use per-feature data len. This patch introduces a new helper function
    nvmet_feat_data_len(), which is used to calculate per feature data len.
    Right now we only set data len for fid 0x81 (NVME_FEAT_HOST_ID).

    Fixes: commit e9061c397839 ("nvmet: Remove the data_len field from the nvmet_req struct")

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Amit Engel
    [endiness, naming, and kernel style fixes]
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Amit Engel
     
  • Decode interrupted command and not ready namespace nvme status codes to
    BLK_STS_TARGET. These are not generic IO errors and should use a non-path
    specific error so that it can use the non-failover retry path.

    Reported-by: John Meneghini
    Cc: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

07 Jan, 2020

1 commit

  • Currently t10-pi can only be built into the block layer which via
    crc-t10dif pulls in a whole chunk of the Crypto API. In fact all
    users of t10-pi work as modules and there is no reason for it to
    always be built-in.

    This patch adds a new hidden option for t10-pi that is selected
    automatically based on BLK_DEV_INTEGRITY and whether the users
    of t10-pi are built-in or not.

    Signed-off-by: Herbert Xu
    Signed-off-by: Jens Axboe

    Herbert Xu
     

14 Dec, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - stable fix for the bi_size overflow. Not a corruption issue, but a
    case wher we could merge but disallowed (Andreas)

    - NVMe pull request via Keith, with various fixes.

    - MD pull request from Song.

    - Merge window regression fix for the rq passthrough stats (Logan)

    - Remove unused blkcg_drain_queue() function (Guoqing)

    * tag 'for-linus-20191212' of git://git.kernel.dk/linux-block:
    blk-cgroup: remove blkcg_drain_queue
    block: fix NULL pointer dereference in account statistics with IDE
    md: make sure desc_nr less than MD_SB_DISKS
    md: raid1: check rdev before reference in raid1_sync_request func
    raid5: need to set STRIPE_HANDLE for batch head
    block: fix "check bi_size overflow before merge"
    nvme/pci: Fix read queue count
    nvme/pci Limit write queue sizes to possible cpus
    nvme/pci: Fix write and poll queue types
    nvme/pci: Remove last_cq_head
    nvme: Namepace identification descriptor list is optional
    nvme-fc: fix double-free scenarios on hw queues
    nvme: else following return is not needed
    nvme: add error message on mismatching controller ids
    nvme_fc: add module to ops template to allow module references
    nvmet-loop: Avoid preallocating big SGL for data
    nvme-fc: Avoid preallocating big SGL for data
    nvme-rdma: Avoid preallocating big SGL for data

    Linus Torvalds
     

07 Dec, 2019

4 commits

  • Pull NVMe fixes from Keith

    * 'nvme/for-5.5' of git://git.infradead.org/nvme:
    nvme/pci: Fix read queue count
    nvme/pci Limit write queue sizes to possible cpus
    nvme/pci: Fix write and poll queue types
    nvme/pci: Remove last_cq_head
    nvme: Namepace identification descriptor list is optional
    nvme-fc: fix double-free scenarios on hw queues
    nvme: else following return is not needed
    nvme: add error message on mismatching controller ids
    nvme_fc: add module to ops template to allow module references
    nvmet-loop: Avoid preallocating big SGL for data
    nvme-fc: Avoid preallocating big SGL for data
    nvme-rdma: Avoid preallocating big SGL for data

    Jens Axboe
     
  • If nvme.write_queues equals the number of CPUs, the driver had decreased
    the number of interrupts available such that there could only be one read
    queue even if the controller could support more. Remove the interrupt
    count reduction in this case. The driver wouldn't request more IRQs than
    it wants queues anyway.

    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch

    Keith Busch
     
  • The driver can never use more queues of any type than the number of
    possible CPUs, so a higher value causes the driver to allocate more
    memory for IO queues than it could ever use. Limit the parameter at
    module load time to the number of possible cpus.

    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch

    Keith Busch
     
  • The number of poll or write queues should never be negative. Use unsigned
    types so that it's not possible to break have the driver not allocate
    any queues.

    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch

    Keith Busch
     

04 Dec, 2019

1 commit

  • Pull PCI updates from Bjorn Helgaas:
    "Enumeration:

    - Warn if a host bridge has no NUMA info (Yunsheng Lin)

    - Add PCI_STD_NUM_BARS for the number of standard BARs (Denis
    Efremov)

    Resource management:

    - Fix boot-time Embedded Controller GPE storm caused by incorrect
    resource assignment after ACPI Bus Check Notification (Mika
    Westerberg)

    - Protect pci_reassign_bridge_resources() against concurrent
    addition/removal (Benjamin Herrenschmidt)

    - Fix bridge dma_ranges resource list cleanup (Rob Herring)

    - Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control
    the MMIO and prefetchable MMIO window sizes of hotplug bridges
    independently (Nicholas Johnson)

    - Fix MMIO/MMIO_PREF window assignment that assigned more space than
    desired (Nicholas Johnson)

    - Only enforce bus numbers from bridge EA if the bridge has EA
    devices downstream (Subbaraya Sundeep)

    - Consolidate DT "dma-ranges" parsing and convert all host drivers to
    use shared parsing (Rob Herring)

    Error reporting:

    - Restore AER capability after resume (Mayurkumar Patel)

    - Add PoisonTLPBlocked AER counter (Rajat Jain)

    - Use for_each_set_bit() to simplify AER code (Andy Shevchenko)

    - Fix AER kernel-doc (Andy Shevchenko)

    - Add "pcie_ports=dpc-native" parameter to allow native use of DPC
    even if platform didn't grant control over AER (Olof Johansson)

    Hotplug:

    - Avoid returning prematurely from sysfs requests to enable or
    disable a PCIe hotplug slot (Lukas Wunner)

    - Don't disable interrupts twice when suspending hotplug ports (Mika
    Westerberg)

    - Fix deadlocks when PCIe ports are hot-removed while suspended (Mika
    Westerberg)

    Power management:

    - Remove unnecessary ASPM locking (Bjorn Helgaas)

    - Add support for disabling L1 PM Substates (Heiner Kallweit)

    - Allow re-enabling Clock PM after it has been disabled (Heiner
    Kallweit)

    - Add sysfs attributes for controlling ASPM link states (Heiner
    Kallweit)

    - Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl"
    sysfs files (Heiner Kallweit)

    - Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on
    USB 2.0 or 1.1 connect events (Kai-Heng Feng)

    - Move power state check out of pci_msi_supported() (Bjorn Helgaas)

    - Fix incorrect MSI-X masking on resume and revert related nvme quirk
    for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan)

    - Always return devices to D0 when thawing to fix hibernation with
    drivers like mlx4 that used legacy power management (previously we
    only did it for drivers with new power management ops) (Dexuan Cui)

    - Clear PCIe PME Status even for legacy power management (Bjorn
    Helgaas)

    - Fix PCI PM documentation errors (Bjorn Helgaas)

    - Use dev_printk() for more power management messages (Bjorn Helgaas)

    - Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas)

    - Convert xen-platform from legacy to generic power management (Bjorn
    Helgaas)

    - Removed unused .resume_early() and .suspend_late() legacy power
    management hooks (Bjorn Helgaas)

    - Rearrange power management code for clarity (Rafael J. Wysocki)

    - Decode power states more clearly ("4" or "D4" really refers to
    "D3cold") (Bjorn Helgaas)

    - Notice when reading PM Control register returns an error (~0)
    instead of interpreting it as being in D3hot (Bjorn Helgaas)

    - Add missing link delays required by the PCIe spec (Mika Westerberg)

    Virtualization:

    - Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn
    Helgaas)

    - Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code
    previously didn't recognize that) (Kuppuswamy Sathyanarayanan)

    - Allow VFs to use PASID (the PF PASID capability is shared by the
    VFs, but the code previously didn't recognize that) (Kuppuswamy
    Sathyanarayanan)

    - Disconnect PF and VF ATS enablement, since ATS in PFs and
    associated VFs can be enabled independently (Kuppuswamy
    Sathyanarayanan)

    - Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan)

    - Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas)

    - Consolidate ATS declarations in linux/pci-ats.h (Krzysztof
    Wilczynski)

    - Remove unused PRI and PASID stubs (Bjorn Helgaas)

    - Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID
    interfaces that are only used by built-in IOMMU drivers (Bjorn
    Helgaas)

    - Hide PRI and PASID state restoration functions used only inside the
    PCI core (Bjorn Helgaas)

    - Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski)

    - Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut)

    - Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George
    Cherian)

    - Fix the UPDCR register address in the Intel ACS quirk (Steffen
    Liebergeld)

    - Unify ACS quirk implementations (Bjorn Helgaas)

    Amlogic Meson host bridge driver:

    - Fix meson PERST# GPIO polarity problem (Remi Pommarel)

    - Add DT bindings for Amlogic Meson G12A (Neil Armstrong)

    - Fix meson clock names to match DT bindings (Neil Armstrong)

    - Add meson support for Amlogic G12A SoC with separate shared PHY
    (Neil Armstrong)

    - Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe
    combo PHY (Neil Armstrong)

    - Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong)

    - Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT
    (Neil Armstrong)

    Broadcom iProc host bridge driver:

    - Invalidate iProc PAXB address mapping before programming it
    (Abhishek Shah)

    - Fix iproc-msi and mvebu __iomem annotations (Ben Dooks)

    Cadence host bridge driver:

    - Refactor Cadence PCIe host controller to use as a library for both
    host and endpoint (Tom Joseph)

    Freescale Layerscape host bridge driver:

    - Add layerscape LS1028a support (Xiaowei Bao)

    Intel VMD host bridge driver:

    - Add VMD bus 224-255 restriction decode (Jon Derrick)

    - Add VMD 8086:9A0B device ID (Jon Derrick)

    - Remove Keith from VMD maintainer list (Keith Busch)

    Marvell ARMADA 3700 / Aardvark host bridge driver:

    - Use LTSSM state to build link training flag since Aardvark doesn't
    implement the Link Training bit (Remi Pommarel)

    - Delay before training Aardvark link in case PERST# was asserted
    before the driver probe (Remi Pommarel)

    - Fix Aardvark issues with Root Control reads and writes (Remi
    Pommarel)

    - Don't rely on jiffies in Aardvark config access path since
    interrupts may be disabled (Remi Pommarel)

    - Fix Aardvark big-endian support (Grzegorz Jaszczyk)

    Marvell ARMADA 370 / XP host bridge driver:

    - Make mvebu_pci_bridge_emul_ops static (Ben Dooks)

    Microsoft Hyper-V host bridge driver:

    - Add hibernation support for Hyper-V virtual PCI devices (Dexuan
    Cui)

    - Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan
    Cui)

    - Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui)

    Mobiveil host bridge driver:

    - Change mobiveil csr_read()/write() function names that conflict
    with riscv arch functions (Kefeng Wang)

    NVIDIA Tegra host bridge driver:

    - Fix Tegra CLKREQ dependency programming (Vidya Sagar)

    Renesas R-Car host bridge driver:

    - Remove unnecessary header include from rcar (Andrew Murray)

    - Tighten register index checking for rcar inbound range programming
    (Marek Vasut)

    - Fix rcar inbound range alignment calculation to improve packing of
    multiple entries (Marek Vasut)

    - Update rcar MACCTLR setting to match documentation (Yoshihiro
    Shimoda)

    - Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual
    (Yoshihiro Shimoda)

    - Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon
    Horman)

    Rockchip host bridge driver:

    - Make rockchip 0V9 and 1V8 power regulators non-optional (Robin
    Murphy)

    Socionext UniPhier host bridge driver:

    - Set uniphier to host (RC) mode always (Kunihiko Hayashi)

    Endpoint drivers:

    - Fix endpoint driver sign extension problem when shifting page
    number to phys_addr_t (Alan Mikhak)

    Misc:

    - Add NumaChip SPDX header (Krzysztof Wilczynski)

    - Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski)

    - Remove unused includes (Krzysztof Wilczynski)

    - Removed unused sysfs attribute groups (Ben Dooks)

    - Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas)

    - Add PCIe Link Control 2 register field definitions to replace magic
    numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas)

    - Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and
    Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas)

    - Use pcie_capability_read_word() instead of pci_read_config_word()
    in AMDGPU and Radeon CIK/SI (Frederick Lawler)

    - Remove unused pci_irq_get_node() Greg Kroah-Hartman)

    - Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig
    (Palmer Dabbelt, Michal Simek)

    - Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe)

    - Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn
    Helgaas)

    - Fix bridge emulation big-endian support (Grzegorz Jaszczyk)

    - Fix dwc find_next_bit() usage (Niklas Cassel)

    - Fix pcitest.c fd leak (Hewenliang)

    - Fix typos and comments (Bjorn Helgaas)

    - Fix Kconfig whitespace errors (Krzysztof Kozlowski)"

    * tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits)
    PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist
    asm-generic: Make msi.h a mandatory include/asm header
    Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
    PCI/MSI: Fix incorrect MSI-X masking on resume
    PCI/MSI: Move power state check out of pci_msi_supported()
    PCI/MSI: Remove unused pci_irq_get_node()
    PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer
    PCI: hv: Change pci_protocol_version to per-hbus
    PCI: hv: Add hibernation support
    PCI: hv: Reorganize the code in preparation of hibernation
    MAINTAINERS: Remove Keith from VMD maintainer
    PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code
    PCI/ASPM: Add sysfs attributes for controlling ASPM link states
    PCI: Fix indentation
    drm/radeon: Prefer pcie_capability_read_word()
    drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
    drm/radeon: Correct Transmit Margin masks
    drm/amdgpu: Prefer pcie_capability_read_word()
    PCI: uniphier: Set mode register to host mode
    drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
    ...

    Linus Torvalds
     

03 Dec, 2019

2 commits

  • We had been saving the last_cq_head seen from an interrupt so that a
    polled queue wouldn't mistakenly trigger spruious interrupt detection. We
    don't poll interrupt driven queues any more, so saving this value is
    pointless.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     
  • Despite NVM Express specification 1.3 requires a controller claiming to
    be 1.3 or higher implement Identify CNS 03h (Namespace Identification
    Descriptor list), the driver doesn't really need this identification in
    order to use a namespace. The code had already documented in comments
    that we're not to consider an error to this command.

    Return success if the controller provided any response to an
    namespace identification descriptors command.

    Fixes: 538af88ea7d9de24 ("nvme: make nvme_report_ns_ids propagate error back")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679
    Reported-by: Ingo Brunberg
    Cc: Sagi Grimberg
    Cc: stable@vger.kernel.org # 5.4+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     

02 Dec, 2019

1 commit

  • Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
    "As part of the cleanup of some remaining y2038 issues, I came to
    fs/compat_ioctl.c, which still has a couple of commands that need
    support for time64_t.

    In completely unrelated work, I spent time on cleaning up parts of
    this file in the past, moving things out into drivers instead.

    After Al Viro reviewed an earlier version of this series and did a lot
    more of that cleanup, I decided to try to completely eliminate the
    rest of it and move it all into drivers.

    This series incorporates some of Al's work and many patches of my own,
    but in the end stops short of actually removing the last part, which
    is the scsi ioctl handlers. I have patches for those as well, but they
    need more testing or possibly a rewrite"

    * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
    scsi: sd: enable compat ioctls for sed-opal
    pktcdvd: add compat_ioctl handler
    compat_ioctl: move SG_GET_REQUEST_TABLE handling
    compat_ioctl: ppp: move simple commands into ppp_generic.c
    compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
    compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
    compat_ioctl: unify copy-in of ppp filters
    tty: handle compat PPP ioctls
    compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
    compat_ioctl: handle SIOCOUTQNSD
    af_unix: add compat_ioctl support
    compat_ioctl: reimplement SG_IO handling
    compat_ioctl: move WDIOC handling into wdt drivers
    fs: compat_ioctl: move FITRIM emulation into file systems
    gfs2: add compat_ioctl support
    compat_ioctl: remove unused convert_in_user macro
    compat_ioctl: remove last RAID handling code
    compat_ioctl: remove /dev/raw ioctl translation
    compat_ioctl: remove PCI ioctl translation
    compat_ioctl: remove joystick ioctl translation
    ...

    Linus Torvalds
     

27 Nov, 2019

6 commits

  • Since e045fa29e893 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is
    merged, we can revert the previous quirk now.

    This reverts commit 19ea025e1d28c629b369c3532a85b3df478cc5c6.

    Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
    Fixes: 19ea025e1d28 ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T")
    Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.com
    Signed-off-by: Jian-Hong Pan
    Signed-off-by: Bjorn Helgaas
    Acked-by: Christoph Hellwig
    Cc: stable@vger.kernel.org

    Jian-Hong Pan
     
  • If an error occurs on one of the ios used for creating an
    association, the creating routine has error paths that are
    invoked by the command failure and the error paths will free
    up the controller resources created to that point.

    But... the io was ultimately determined by an asynchronous
    completion routine that detected the error and which
    unconditionally invokes the error_recovery path which calls
    delete_association. Delete association deletes all outstanding
    io then tears down the controller resources. So the
    create_association thread can be running in parallel with
    the error_recovery thread. What was seen was the LLDD received
    a call to delete a queue, causing the LLDD to do a free of a
    resource, then the transport called the delete queue again
    causing the driver to repeat the free call. The second free
    routine corrupted the allocator. The transport shouldn't be
    making the duplicate call, and the delete queue is just one
    of the resources being freed.

    To fix, it is realized that the create_association path is
    completely serialized with one command at a time. So the
    failed io completion will always be seen by the create_association
    path and as of the failure, there are no ios to terminate and there
    is no reason to be manipulating queue freeze states, etc.
    The serialized condition stays true until the controller is
    transitioned to the LIVE state. Thus the fix is to change the
    error recovery path to check the controller state and only
    invoke the teardown path if not already in the CONNECTING state.

    Reviewed-by: Himanshu Madhani
    Reviewed-by: Ewan D. Milne
    Signed-off-by: James Smart
    Signed-off-by: Keith Busch

    James Smart
     
  • Remove unnecessary keyword in nvme_create_queue().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Edmund Nadolski
    Signed-off-by: Keith Busch

    Edmund Nadolski
     
  • We've seen a few devices that return different controller id's to
    the Fabric Connect command vs the Identify(controller) command. It's
    currently hard to identify this failure by existing error messages. It
    comes across as a (re)connect attempt in the transport that fails with
    a -22 (-EINVAL) status. The issue is compounded by older kernels not
    having the controller id check or had the identify command overwrite the
    fabrics controller id value before it checked. Both resulted in cases
    where the devices appeared fine until more recent kernels.

    Clarify the reject by adding an error message on controller id mismatches.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ewan D. Milne
    Signed-off-by: James Smart
    Signed-off-by: Keith Busch

    James Smart
     
  • In nvme-fc: it's possible to have connected active controllers
    and as no references are taken on the LLDD, the LLDD can be
    unloaded. The controller would enter a reconnect state and as
    long as the LLDD resumed within the reconnect timeout, the
    controller would resume. But if a namespace on the controller
    is the root device, allowing the driver to unload can be problematic.
    To reload the driver, it may require new io to the boot device,
    and as it's no longer connected we get into a catch-22 that
    eventually fails, and the system locks up.

    Fix this issue by taking a module reference for every connected
    controller (which is what the core layer did to the transport
    module). Reference is cleared when the controller is removed.

    Acked-by: Himanshu Madhani
    Reviewed-by: Christoph Hellwig
    Signed-off-by: James Smart
    Signed-off-by: Keith Busch

    James Smart
     
  • nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based
    on SG_CHUNK_SIZE.

    Modern DMA engines are often capable of dealing with very big segments so
    the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
    SGL allocation per command.

    If a controller has lots of deep queues, preallocation for the sg list can
    consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be
    128 and each queue's depth 128. This means the resulting preallocation
    for the data SGL is 128*128*4K = 64MB per controller.

    Switch to runtime allocation for SGL for lists longer than 2 entries. This
    is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
    well. Runtime SGL allocation has always been the case for the legacy I/O
    path so this is nothing new.

    Tested-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Israel Rukshin
    Signed-off-by: Keith Busch

    Israel Rukshin