23 Jul, 2019

4 commits

  • This reverts commit 0298d5435276e7795b0b939d74827f6e775e7009.

    With this patch, set 'poll_queues > hard queues' will lead to 'nr_read_queues = 0'
    in nvme_calc_irq_sets. Then poll_queues setting can fail since dev->tagset.nr_maps
    equals to 2 and nvme_pci_map_queues will not do map for poll queues.

    Signed-off-by: yangerkun
    Signed-off-by: Christoph Hellwig

    yangerkun
     
  • Fix a crash with multipath activated. It happends when ANA log
    page is larger than MDTS and because of that ANA is disabled.
    The driver then tries to access unallocated buffer when connecting
    to a nvme target. The signature is as follows:

    [ 300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
    [ 300.435387] nvme nvme0: disabling ANA support.
    [ 300.437835] nvme nvme0: creating 4 I/O queues.
    [ 300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
    [ 300.464609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    [ 300.466342] #PF error: [normal kernel read fault]
    [ 300.467385] PGD 0 P4D 0
    [ 300.467987] Oops: 0000 [#1] SMP PTI
    [ 300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
    [ 300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    [ 300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
    [ 300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
    [ 300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
    [ 300.477374] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
    [ 300.478334] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
    [ 300.479784] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
    [ 300.481488] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
    [ 300.483203] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
    [ 300.484928] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
    [ 300.486626] FS: 0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
    [ 300.488538] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 300.489907] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
    [ 300.491612] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 300.493303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 300.494991] Call Trace:
    [ 300.495645] nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
    [ 300.496880] nvme_validate_ns+0x2ef/0x550 [nvme_core]
    [ 300.498105] ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
    [ 300.499539] nvme_scan_work+0x2b4/0x370 [nvme_core]
    [ 300.500717] ? __switch_to_asm+0x35/0x70
    [ 300.501663] process_one_work+0x171/0x380
    [ 300.502340] worker_thread+0x49/0x3f0
    [ 300.503079] kthread+0xf8/0x130
    [ 300.503795] ? max_active_store+0x80/0x80
    [ 300.504690] ? kthread_bind+0x10/0x10
    [ 300.505502] ret_from_fork+0x35/0x40
    [ 300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio
    [ 300.514984] CR2: 0000000000000008
    [ 300.515569] ---[ end trace faa2eefad7e7f218 ]---
    [ 300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
    [ 300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
    [ 300.520353] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
    [ 300.521229] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
    [ 300.522399] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
    [ 300.523560] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
    [ 300.524734] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
    [ 300.525915] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
    [ 300.527084] FS: 0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
    [ 300.528396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 300.529440] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
    [ 300.530739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 300.531989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 300.533264] Kernel panic - not syncing: Fatal exception
    [ 300.534338] Kernel Offset: 0x17c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [ 300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---

    Condition check refactoring from Christoph Hellwig.

    Signed-off-by: Marta Rybczynska
    Tested-by: Jean-Baptiste Riaux
    Signed-off-by: Christoph Hellwig

    Marta Rybczynska
     
  • When freeing the subsystem after finding another match with
    __nvme_find_get_subsystem(), use put_device() instead of
    __nvme_release_subsystem() which calls kfree() directly.

    Per the documentation, put_device() should always be used
    after device_initialization() is called. Otherwise, leaks
    like the one below which was detected by kmemleak may occur.

    Once the call of __nvme_release_subsystem() is removed it no
    longer makes sense to keep the helper, so fold it back
    into nvme_release_subsystem().

    unreferenced object 0xffff8883d12bfbc0 (size 16):
    comm "nvme", pid 2635, jiffies 4294933602 (age 739.952s)
    hex dump (first 16 bytes):
    6e 76 6d 65 2d 73 75 62 73 79 73 32 00 88 ff ff nvme-subsys2....
    backtrace:
    [] __kmalloc_track_caller+0x16d/0x2a0
    [] kvasprintf+0xad/0x130
    [] kvasprintf_const+0x47/0x120
    [] kobject_set_name_vargs+0x44/0x120
    [] dev_set_name+0x98/0xc0
    [] nvme_init_identify+0x1995/0x38e0
    [] nvme_loop_configure_admin_queue+0x4fa/0x5e0
    [] nvme_loop_create_ctrl+0x489/0xf80
    [] nvmf_dev_write+0x1a12/0x2220
    [] __vfs_write+0x66/0x120
    [] vfs_write+0x154/0x490
    [] ksys_write+0x10a/0x240
    [] __x64_sys_write+0x73/0xb0
    [] do_syscall_64+0xaa/0x470
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: ab9e00cc72fa ("nvme: track subsystems")
    Signed-off-by: Logan Gunthorpe
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Logan Gunthorpe
     
  • The ADATA SX6000LNP NVMe SSDs have the same subnqn and, due to this, a
    system with more than one of these SSDs will only have one usable.

    [ 0.942706] nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2018-05.com.example:nvme:nvm-subsystem-OUI00E04C).
    [ 0.943017] nvme nvme1: Removing after probe failure status: -22

    02:00.0 Non-Volatile memory controller [0108]: Realtek Semiconductor Co., Ltd. Device [10ec:5762] (rev 01)
    71:00.0 Non-Volatile memory controller [0108]: Realtek Semiconductor Co., Ltd. Device [10ec:5762] (rev 01)

    There are no firmware updates available from the vendor, unfortunately.
    Applying the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk for these SSDs resolves
    the issue, and they all work after this patch:

    /dev/nvme0n1 2J1120050420 ADATA SX6000LNP [...]
    /dev/nvme1n1 2J1120050540 ADATA SX6000LNP [...]

    Signed-off-by: Misha Nasledov
    Signed-off-by: Christoph Hellwig

    Misha Nasledov
     

16 Jul, 2019

2 commits

  • Pull more block updates from Jens Axboe:
    "A later pull request with some followup items. I had some vacation
    coming up to the merge window, so certain things items were delayed a
    bit. This pull request also contains fixes that came in within the
    last few days of the merge window, which I didn't want to push right
    before sending you a pull request.

    This contains:

    - NVMe pull request, mostly fixes, but also a few minor items on the
    feature side that were timing constrained (Christoph et al)

    - Report zones fixes (Damien)

    - Removal of dead code (Damien)

    - Turn on cgroup psi memstall (Josef)

    - block cgroup MAINTAINERS entry (Konstantin)

    - Flush init fix (Josef)

    - blk-throttle low iops timing fix (Konstantin)

    - nbd resize fixes (Mike)

    - nbd 0 blocksize crash fix (Xiubo)

    - block integrity error leak fix (Wenwen)

    - blk-cgroup writeback and priority inheritance fixes (Tejun)"

    * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
    MAINTAINERS: add entry for block io cgroup
    null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
    block: Limit zone array allocation size
    sd_zbc: Fix report zones buffer allocation
    block: Kill gfp_t argument of blkdev_report_zones()
    block: Allow mapping of vmalloc-ed buffers
    block/bio-integrity: fix a memory leak bug
    nvme: fix NULL deref for fabrics options
    nbd: add netlink reconfigure resize support
    nbd: fix crash when the blksize is zero
    block: Disable write plugging for zoned block devices
    block: Fix elevator name declaration
    block: Remove unused definitions
    nvme: fix regression upon hot device removal and insertion
    blk-throttle: fix zero wait time for iops throttled group
    block: Fix potential overflow in blk_report_zones()
    blkcg: implement REQ_CGROUP_PUNT
    blkcg, writeback: Implement wbc_blkcg_css()
    blkcg, writeback: Add wbc->no_cgroup_owner
    blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
    ...

    Linus Torvalds
     
  • Pull rdma updates from Jason Gunthorpe:
    "A smaller cycle this time. Notably we see another new driver, 'Soft
    iWarp', and the deletion of an ancient unused driver for nes.

    - Revise and simplify the signature offload RDMA MR APIs

    - More progress on hoisting object allocation boiler plate code out
    of the drivers

    - Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib,
    i40iw

    - Tree wide cleanups: struct_size, put_user_page, xarray, rst doc
    conversion

    - Removal of obsolete ib_ucm chardev and nes driver

    - netlink based discovery of chardevs and autoloading of the modules
    providing them

    - Move more of the rdamvt/hfi1 uapi to include/uapi/rdma

    - New driver 'siw' for software based iWarp running on top of netdev,
    much like rxe's software RoCE.

    - mlx5 feature to report events in their raw devx format to userspace

    - Expose per-object counters through rdma tool

    - Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
    from netdev"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits)
    RMDA/siw: Require a 64 bit arch
    RDMA/siw: Mark expected switch fall-throughs
    RDMA/core: Fix -Wunused-const-variable warnings
    rdma/siw: Remove set but not used variable 's'
    rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS
    RDMA/siw: Add missing rtnl_lock around access to ifa
    rdma/siw: Use proper enumerated type in map_cqe_status
    RDMA/siw: Remove unnecessary kthread create/destroy printouts
    IB/rdmavt: Fix variable shadowing issue in rvt_create_cq
    RDMA/core: Fix race when resolving IP address
    RDMA/core: Make rdma_counter.h compile stand alone
    IB/core: Work on the caller socket net namespace in nldev_newlink()
    RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
    RDMA/mlx5: Set RDMA DIM to be enabled by default
    RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
    RDMA/core: Provide RDMA DIM support for ULPs
    linux/dim: Implement RDMA adaptive moderation (DIM)
    IB/mlx5: Report correctly tag matching rendezvous capability
    docs: infiniband: add it to the driver-api bookset
    IB/mlx5: Implement VHCA tunnel mechanism in DEVX
    ...

    Linus Torvalds
     

12 Jul, 2019

2 commits

  • Pull SCSI scatter-gather list updates from James Bottomley:
    "This topic branch covers a fundamental change in how our sg lists are
    allocated to make mq more efficient by reducing the size of the
    preallocated sg list.

    This necessitates a large number of driver changes because the
    previous guarantee that if a driver specified SG_ALL as the size of
    its scatter list, it would get a non-chained list and didn't need to
    bother with scatterlist iterators is now broken and every driver
    *must* use scatterlist iterators.

    This was broken out as a separate topic because we need to convert all
    the drivers before pulling the trigger and unconverted drivers kept
    being found, necessitating a rebase"

    * tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits)
    scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN
    scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation
    scsi: core: avoid preallocating big SGL for data
    scsi: core: avoid preallocating big SGL for protection information
    scsi: lib/sg_pool.c: improve APIs for allocating sg pool
    scsi: esp: use sg helper to iterate over scatterlist
    scsi: NCR5380: use sg helper to iterate over scatterlist
    scsi: wd33c93: use sg helper to iterate over scatterlist
    scsi: ppa: use sg helper to iterate over scatterlist
    scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist
    scsi: imm: use sg helper to iterate over scatterlist
    scsi: aha152x: use sg helper to iterate over scatterlist
    scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist
    scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist
    scsi: usb: image: microtek: use sg helper to iterate over scatterlist
    scsi: pmcraid: use sg helper to iterate over scatterlist
    scsi: ipr: use sg helper to iterate over scatterlist
    scsi: mvumi: use sg helper to iterate over scatterlist
    scsi: lpfc: use sg helper to iterate over scatterlist
    scsi: advansys: use sg helper to iterate over scatterlist
    ...

    Linus Torvalds
     
  • git://git.infradead.org/nvme.git nvme-5.3 branch now causes the
    following NULL deref oops. Check the ctrl->opts first before the deref.

    [ 16.337581] BUG: kernel NULL pointer dereference, address: 0000000000000056
    [ 16.338551] #PF: supervisor read access in kernel mode
    [ 16.338551] #PF: error_code(0x0000) - not-present page
    [ 16.338551] PGD 0 P4D 0
    [ 16.338551] Oops: 0000 [#1] SMP PTI
    [ 16.338551] CPU: 2 PID: 1035 Comm: kworker/u16:5 Not tainted 5.2.0-rc6+ #1
    [ 16.338551] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
    [ 16.338551] Workqueue: nvme-wq nvme_scan_work [nvme_core]
    [ 16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
    [ 16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
    [ 16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
    [ 16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
    [ 16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
    [ 16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
    [ 16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
    [ 16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
    [ 16.338551] FS: 0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
    [ 16.338551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
    [ 16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 16.338551] Call Trace:
    [ 16.338551] nvme_scan_work+0x2c0/0x340 [nvme_core]
    [ 16.338551] ? __switch_to_asm+0x40/0x70
    [ 16.338551] ? _raw_spin_unlock_irqrestore+0x18/0x30
    [ 16.338551] ? try_to_wake_up+0x408/0x450
    [ 16.338551] process_one_work+0x20b/0x3e0
    [ 16.338551] worker_thread+0x1f9/0x3d0
    [ 16.338551] ? cancel_delayed_work+0xa0/0xa0
    [ 16.338551] kthread+0x117/0x120
    [ 16.338551] ? kthread_stop+0xf0/0xf0
    [ 16.338551] ret_from_fork+0x3a/0x50
    [ 16.338551] Modules linked in: nvme nvme_core
    [ 16.338551] CR2: 0000000000000056
    [ 16.338551] ---[ end trace b9bf761a93e62d84 ]---
    [ 16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
    [ 16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
    [ 16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
    [ 16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
    [ 16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
    [ 16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
    [ 16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
    [ 16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
    [ 16.338551] FS: 0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
    [ 16.338551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
    [ 16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 958f2a0f8121 ("nvme-tcp: set the STABLE_WRITES flag when data digests are enabled")
    Cc: Christoph Hellwig
    Cc: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Minwoo Im
    Signed-off-by: Jens Axboe

    Minwoo Im
     

11 Jul, 2019

1 commit

  • When we validate the new controller id, we want to skip
    controllers that are either deleting or dead. Fix the check
    to do that and not on the newly added controller.

    Fixes: 1b1031ca63b2 ("nvme: validate cntlid during controller initialisation")
    Reported-by: Jon Derrick
    Tested-by: Jon Derrick
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     

10 Jul, 2019

18 commits

  • Current code allows the module to be unloaded even if there are
    pending data structures, such as localports and controllers on
    the localports, that have yet to hit their reference counting
    to remove them.

    Fix by having exit entrypoint explicitly delete every controller,
    which in turn will remove references on the remoteports and localports
    causing them to be deleted as well. The exit entrypoint, after
    initiating the deletes, will wait for the last localport to be deleted
    before continuing.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • According to commit a10674bf2406 ("tcp: detecting the misuse of
    .sendpage for Slab objects") and previous discussion, tcp_sendpage
    should not be used for pages that is managed by SLAB, as SLAB is not
    taking page reference counters into consideration.

    Signed-off-by: Mikhail Skorzhinskii
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Mikhail Skorzhinskii
     
  • There was a few false alarms sighted on target side about wrong data
    digest while performing high throughput load to XFS filesystem shared
    through NVMoF TCP.

    This flag tells the rest of the kernel to ensure that the data buffer
    does not change while the write is in flight. It incurs a performance
    penalty, so only enable it when it is actually needed, i.e. when we are
    calculating data digests.

    Although even with this change in place, ext2 users can steel experience
    false positives, as ext2 is not respecting this flag. This may be apply
    to vfat as well.

    Signed-off-by: Mikhail Skorzhinskii
    Signed-off-by: Mike Playle
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Mikhail Skorzhinskii
     
  • Adding this hint for the sake of convenience.

    It was spotted that a few times people spent some time before
    understanding what is exactly wrong in configuration process. This
    should save a few time in such situations, especially for people who
    is not very confident with NVMe requirements.

    Signed-off-by: Mikhail Skorzhinskii
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Mikhail Skorzhinskii
     
  • nvme_ns_remove() will first set the NVME_NS_REMOVING flag before removing
    it from the list at the very last step.
    So to avoid selecting a namespace in nvme_find_path() which is about to be
    removed check the NVME_NS_REMOVING flag, too, when selecting a new path.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Hannes Reinecke
     
  • When we have a singular list in nvme_round_robin_path() we still
    need to check its validity.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Hannes Reinecke
     
  • Factor our a common helper to check if a path has been disabled
    by something other than the per-namespace ANA state.

    Signed-off-by: Hannes Reinecke
    [hch: split from a bigger patch]
    Signed-off-by: Christoph Hellwig

    Hannes Reinecke
     
  • >From the NVMe 1.4 spec:

    NSFEAT bit 4 if set to 1: indicates that the fields NPWG, NPWA, NPDG, NPDA,
    and NOWS are defined for this namespace and should be used by the host for
    I/O optimization;
    [ ... ]
    Namespace Preferred Write Granularity (NPWG): This field indicates the
    smallest recommended write granularity in logical blocks for this namespace.
    This is a 0's based value. The size indicated should be less than or equal
    to Maximum Data Transfer Size (MDTS) that is specified in units of minimum
    memory page size. The value of this field may change if the namespace is
    reformatted. The size should be a multiple of Namespace Preferred Write
    Alignment (NPWA). Refer to section 8.25 for how this field is utilized to
    improve performance and endurance.
    [ ... ]
    Each Write, Write Uncorrectable, or Write Zeroes commands should address a
    multiple of Namespace Preferred Write Granularity (NPWG) (refer to Figure
    245) and Stream Write Size (SWS) (refer to Figure 515) logical blocks (as
    expressed in the NLB field), and the SLBA field of the command should be
    aligned to Namespace Preferred Write Alignment (NPWA) (refer to Figure 245)
    for best performance.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Bart Van Assche
     
  • Make the NVMe NAWUN, NAWUPF, NACWU, NPWG, NPWA, NPDG and NOWS attributes
    available to initator systems for the block backend.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Bart Van Assche
     
  • The trace log for 'delete I/O submission queue' and 'delete I/O
    completion queue' command will look like as below:

    kworker/u49:1-3438 [003] .... 6693.070865: nvme_setup_cmd: nvme0: qid=0, cmdid=11, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_delete_sq sqid=1)
    kworker/u49:1-3438 [003] .... 6693.071171: nvme_setup_cmd: nvme0: qid=0, cmdid=8, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_delete_cq cqid=24)

    Signed-off-by: Tom Wu
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Minwoo Im
    Reviewed-by: Israel Rukshin
    Signed-off-by: Christoph Hellwig

    Tom Wu
     
  • There are two spelling mistakes in trace_seq_printf messages, fix these.

    Signed-off-by: Colin Ian King
    Reviewed-by: Minwoo Im
    Signed-off-by: Christoph Hellwig

    Colin Ian King
     
  • When running a NVMe device that is attached to a addressing
    challenged PCIe root port that requires bounce buffering, our
    request sizes can easily overflow the swiotlb bounce buffer
    size. Limit the maximum I/O size to the limit exposed by
    the DMA mapping subsystem.

    Signed-off-by: Christoph Hellwig
    Reported-by: Atish Patra
    Tested-by: Atish Patra
    Reviewed-by: Sagi Grimberg

    Christoph Hellwig
     
  • Modify nvme_alloc_sq_cmds() to call pci_free_p2pmem() to free the memory
    it allocated using pci_alloc_p2pmem() in case pci_p2pmem_virt_to_bus()
    returns null.

    Makes sure not to call pci_free_p2pmem() if pci_alloc_p2pmem() returned
    NULL, which can happen if CONFIG_PCI_P2PDMA is not configured.

    The current implementation is not expected to leak since
    pci_p2pmem_virt_to_bus() is expected to fail only if pci_alloc_p2pmem()
    returns null. However, checking the return value of pci_alloc_p2pmem()
    is more explicit.

    Signed-off-by: Alan Mikhak
    Signed-off-by: Christoph Hellwig

    Alan Mikhak
     
  • Only request an IRQ mapping for read queues if at least one read queue
    is being allocted, as nvme_pci_map_queues() will later on ignore the
    unnecessary mapping request should nvme_dev_add() request such an IRQ
    mapping even though no read queues are being allocated. However,
    nvme_dev_add() can avoid making the request by checking the number of
    read queues without assuming. This would bring it more in line with
    nvme_setup_irqs() and nvme_calc_irq_sets().

    Signed-off-by: Alan Mikhak
    Signed-off-by: Christoph Hellwig

    Alan Mikhak
     
  • Since Linux 5.0 drivers can safely set the largest DMA mask supported
    by the device, and don't need fallbacks to work around the dma mapping
    implementations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg

    Christoph Hellwig
     
  • Fix sparse warning:

    drivers/nvme/host/pci.c:2926:25: warning:
    symbol 'nvme_dev_pm_ops' was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Reviewed-by: Minwoo Im
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    YueHaibing
     
  • With additional debugging enabled, seeing warnings for suspicious RCU
    usage or Sleeping function called from invalid context.

    These both map to allocation of a work structure which is currently
    GFP_KERNEL, meaning it can sleep. For the RCU warning, the sequence was
    sleeping while holding the RCU lock.

    Convert the allocation to GFP_ATOMIC.

    Signed-off-by: James Smart
    Reviewed-by: Minwoo Im
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • With extra debug on, inconsistent lock state warnings are being called
    out as the tfcp_req->reqlock is being taken out without irq, while some
    calling sequences have the sequence in a softirq state.

    Change the lock taking/release to raise/drop irq.

    Signed-off-by: James Smart
    Reviewed-by: Minwoo Im
    Signed-off-by: Christoph Hellwig

    James Smart
     

29 Jun, 2019

1 commit

  • For dependencies in next patches.

    Resolve conflicts:
    - Use uverbs_get_cleared_udata() with new cq allocation flow
    - Continue to delete nes despite SPDX conflict
    - Resolve list appends in mlx5_command_str()
    - Use u16 for vport_rule stuff
    - Resolve list appends in struct ib_client

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

24 Jun, 2019

1 commit


21 Jun, 2019

11 commits

  • This enables to inject errors into the commands submitted to the admin
    queue.

    It is useful to test error handling in the controller initialization.

    # echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
    # echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
    # echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
    # nvme reset /dev/nvme0
    # dmesg
    ...
    nvme nvme0: Could not set queue count (16385)
    nvme nvme0: IO queues not created

    Signed-off-by: Akinobu Mita
    Reviewed-by: Minwoo Im
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christoph Hellwig

    Akinobu Mita
     
  • Currenlty fault injection support for nvme only enables to inject errors
    into the commands submitted to I/O queues.

    In preparation for fault injection into the admin commands, this makes
    the helper functions independent of struct nvme_ns.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Minwoo Im
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Akinobu Mita
     
  • This patch introduces target-side request tracing. As Christoph
    suggested, the trace would not be in a core or module to avoid
    disadvantages like cache miss:
    http://lists.infradead.org/pipermail/linux-nvme/2019-June/024721.html

    The target-side trace code is entirely based on the Johannes's trace code
    from the host side. It has lots of codes duplicated, but it would be
    better than having advantages mentioned above.

    It also traces not only fabrics commands, but also nvme normal commands.
    Once the codes to be shared gets bigger, then we can make it common as
    suggsted.

    This also removed the create_sq and create_cq trace parsing functions
    because it will be done by the connect fabrics command.

    Example:
    echo 1 > /sys/kernel/debug/tracing/event/nvmet/nvmet_req_init/enable
    echo 1 > /sys/kernel/debug/tracing/event/nvmet/nvmet_req_complete/enable
    cat /sys/kernel/debug/tracing/trace

    Signed-off-by: Minwoo Im
    [hch: fixed the symbol namespace and a an endianess conversion]
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     
  • The "result" field is in 64bit to be printed out which means it could be
    like:
    nvme_complete_rq: nvme0: qid=0, cmdid=0, res=18446612684158962624, etries=0, flags=0x0, status=0

    Switch both the result and status field to be printed in hexadecimal
    format to be easier to read.

    Signed-off-by: Minwoo Im
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     
  • This patch introduces fabrics commands tracing feature from host-side.
    This patch does not include any changes for the previous host-side
    tracing, but just add fabrics commands parsing in cmd=() format.

    Signed-off-by: Minwoo Im
    [hch: fixed some whitespace damage]
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     
  • The following patches are going to provide the target-side trace which
    might need these kind of macros. It would be great if it can be shared
    between host and target side both.

    Signed-off-by: Minwoo Im
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     
  • nvme_trace_disk_name() is now already being invoked with the function
    prototype in trace.h. We don't need to export this symbol at all.

    The following patches are going to provide target-side trace feature
    with the exactly same function with this so that this patch removes the
    EXPORT_SYMBOL() for this function.

    Signed-off-by: Minwoo Im
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     
  • Remove the status parameter o nvme_remove_dead_ctrl(), which is only
    used for printing it.

    We move the print message to the same function where actual error is
    occurring.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Chaitanya Kulkarni
     
  • If the state change to NVME_CTRL_CONNECTING fails, the dmesg is going to
    be like:

    [ 293.689160] nvme nvme0: failed to mark controller CONNECTING
    [ 293.689160] nvme nvme0: Removing after probe failure status: 0

    Even it prints the first line to indicate the situation, the second line
    is not proper because the status is 0 which means normally success of
    the previous operation.

    This patch makes it indicate the proper error value when it fails.
    [ 25.932367] nvme nvme0: failed to mark controller CONNECTING
    [ 25.932369] nvme nvme0: Removing after probe failure status: -16

    This situation is able to be easily reproduced by:
    root@target:~# rmmod nvme && modprobe nvme && rmmod nvme

    Signed-off-by: Minwoo Im
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     
  • This patch removes the confusing assignment of the variable result at
    the time of declaration and sets the value in error cases next to the
    places where the actual error is happening.

    Here we also set the result value to -ENODEV when we fail at the final
    ctrl state transition in nvme_reset_work(). Without this assignment
    result will hold 0 from nvme_setup_io_queue() and on failure 0 will be
    passed to he nvme_remove_dead_ctrl() from final state transition.

    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Chaitanya Kulkarni
     
  • If the "irq_queues" are greater than num_possible_cpus(),
    nvme_calc_irq_sets() can have irq set_size for HCTX_TYPE_DEFAULT greater
    than it can be afforded.
    2039 affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;

    It might cause a WARN() from the irq_build_affinity_masks() like [1]:
    220 if (nr_present < numvecs)
    221 WARN_ON(nr_present + nr_others < numvecs);

    This patch prevents it from the WARN() by adjusting the max_vector value
    from the nvme_setup_irqs().

    [1] WARN messages when modprobe nvme write_queues=32 poll_queues=0:
    root@target:~/nvme# nproc
    8
    root@target:~/nvme# modprobe nvme write_queues=32 poll_queues=0
    [ 17.925326] nvme nvme0: pci function 0000:00:04.0
    [ 17.940601] WARNING: CPU: 3 PID: 1030 at kernel/irq/affinity.c:221 irq_create_affinity_masks+0x222/0x330
    [ 17.940602] Modules linked in: nvme nvme_core [last unloaded: nvme]
    [ 17.940605] CPU: 3 PID: 1030 Comm: kworker/u17:4 Tainted: G W 5.1.0+ #156
    [ 17.940605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
    [ 17.940608] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    [ 17.940609] RIP: 0010:irq_create_affinity_masks+0x222/0x330
    [ 17.940611] Code: 4c 8d 4c 24 28 4c 8d 44 24 30 e8 c9 fa ff ff 89 44 24 18 e8 c0 38 fa ff 8b 44 24 18 44 8b 54 24 1c 5a 44 01 d0 41 39 c4 76 02 0b 48 89 df 44 01 e5 e8 f1 ce 10 00 48 8b 34 24 44 89 f0 44 01
    [ 17.940611] RSP: 0018:ffffc90002277c50 EFLAGS: 00010216
    [ 17.940612] RAX: 0000000000000008 RBX: ffff88807ca48860 RCX: 0000000000000000
    [ 17.940612] RDX: ffff88807bc03800 RSI: 0000000000000020 RDI: 0000000000000000
    [ 17.940613] RBP: 0000000000000001 R08: ffffc90002277c78 R09: ffffc90002277c70
    [ 17.940613] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000020
    [ 17.940614] R13: 0000000000025d08 R14: 0000000000000001 R15: ffff88807bc03800
    [ 17.940614] FS: 0000000000000000(0000) GS:ffff88807db80000(0000) knlGS:0000000000000000
    [ 17.940616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 17.940617] CR2: 00005635e583f790 CR3: 000000000240a000 CR4: 00000000000006e0
    [ 17.940617] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 17.940618] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 17.940618] Call Trace:
    [ 17.940622] __pci_enable_msix_range+0x215/0x540
    [ 17.940623] ? kernfs_put+0x117/0x160
    [ 17.940625] pci_alloc_irq_vectors_affinity+0x74/0x110
    [ 17.940626] nvme_reset_work+0xc30/0x1397 [nvme]
    [ 17.940628] ? __switch_to_asm+0x34/0x70
    [ 17.940628] ? __switch_to_asm+0x40/0x70
    [ 17.940629] ? __switch_to_asm+0x34/0x70
    [ 17.940630] ? __switch_to_asm+0x40/0x70
    [ 17.940630] ? __switch_to_asm+0x34/0x70
    [ 17.940631] ? __switch_to_asm+0x40/0x70
    [ 17.940632] ? nvme_irq_check+0x30/0x30 [nvme]
    [ 17.940633] process_one_work+0x20b/0x3e0
    [ 17.940634] worker_thread+0x1f9/0x3d0
    [ 17.940635] ? cancel_delayed_work+0xa0/0xa0
    [ 17.940636] kthread+0x117/0x120
    [ 17.940637] ? kthread_stop+0xf0/0xf0
    [ 17.940638] ret_from_fork+0x3a/0x50
    [ 17.940639] ---[ end trace aca8a131361cd42a ]---
    [ 17.942124] nvme nvme0: 7/1/0 default/read/poll queues

    Signed-off-by: Minwoo Im
    Signed-off-by: Christoph Hellwig

    Minwoo Im