05 Jan, 2020

1 commit

  • [ Upstream commit 4e24e37d5313edca8b4ab86f240c046c731e28d6 ]

    drivers/nvdimm/btt.c: In function 'btt_read_pg':
    drivers/nvdimm/btt.c:1264:8: warning: variable 'rc' set but not used
    [-Wunused-but-set-variable]
    int rc;
    ^~

    Add a ratelimited message in case a storm of errors is encountered.

    Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing")
    Signed-off-by: Qian Cai
    Reviewed-by: Vishal Verma
    Link: https://lore.kernel.org/r/1572530719-32161-1-git-send-email-cai@lca.pw
    Signed-off-by: Dan Williams
    Signed-off-by: Sasha Levin

    Qian Cai
     

30 Sep, 2019

1 commit

  • More libnvdimm updates from Dan Williams:

    - Complete the reworks to interoperate with powerpc dynamic huge page
    sizes

    - Fix a crash due to missed accounting for the powerpc 'struct
    page'-memmap mapping granularity

    - Fix badblock initialization for volatile (DRAM emulated) pmem ranges

    - Stop triggering request_key() notifications to userspace when
    NVDIMM-security is disabled / not present

    - Miscellaneous small fixups

    * tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm/region: Enable MAP_SYNC for volatile regions
    libnvdimm: prevent nvdimm from requesting key when security is disabled
    libnvdimm/region: Initialize bad block for volatile namespaces
    libnvdimm/nfit_test: Fix acpi_handle redefinition
    libnvdimm/altmap: Track namespace boundaries in altmap
    libnvdimm: Fix endian conversion issues 
    libnvdimm/dax: Pick the right alignment default when creating dax devices
    powerpc/book3s64: Export has_transparent_hugepage() related functions.

    Linus Torvalds
     

25 Sep, 2019

6 commits

  • Some environments want to use a host tmpfs/ramdisk to back guest pmem.
    While the data is not persisted relative to the host it *is* persisted
    relative to guest crashes / reboots. The guest is free to use dax and
    MAP_SYNC to keep filesystem metadata consistent with dax accesses
    without requiring guest fsync(). The guest can also observe that the
    region is volatile and skip cache flushing as global visibility is
    enough to "persist" data relative to the host staying alive over guest
    reset events.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Pankaj Gupta
    Link: https://lore.kernel.org/r/20190924114327.14700-1-aneesh.kumar@linux.ibm.com
    [djbw: reword the changelog]
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • Current implementation attempts to request keys from the keyring even when
    security is not enabled. Change behavior so when security is disabled it
    will skip key request.

    Error messages seen when no keys are installed and libnvdimm is loaded:

    request-key[4598]: Cannot find command to construct key 661489677
    request-key[4606]: Cannot find command to construct key 34713726

    Cc: stable@vger.kernel.org
    Fixes: 4c6926a23b76 ("acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs")
    Signed-off-by: Dave Jiang
    Link: https://lore.kernel.org/r/156934642272.30222.5230162488753445916.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dan Williams

    Dave Jiang
     
  • We do check for a bad block during namespace init and that use
    region bad block list. We need to initialize the bad block
    for volatile regions for this to work. We also observe a lockdep
    warning as below because the lock is not initialized correctly
    since we skip bad block init for volatile regions.

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
    Call Trace:
    [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
    [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
    [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
    [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
    [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
    [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
    [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
    [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
    [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
    [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
    [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
    [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
    [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
    [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
    [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
    [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
    [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
    [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
    [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
    [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
    [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
    [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
    [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
    area. Some architectures map the memmap area with large page size. On
    architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
    This maps a namespace size of 16G.

    When populating memmap region with 16MB page from the device area,
    make sure the allocated space is not used to map resources outside this
    namespace. Such usage of device area will prevent a namespace destroy.

    Add resource end pnf in altmap and use that to check if the memmap area
    allocation can map pfn outside the namespace. On ppc64 in such case we fallback
    to allocation from memory.

    This fix kernel crash reported below:

    [ 132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
    [ 133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
    [ 133.464760] Faulting instruction address: 0xc00000000007580c
    [ 133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    .....
    [ 133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
    [ 133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
    [ 133.464910] Call Trace:
    [ 133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
    [ 133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
    [ 133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
    [ 133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
    [ 133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
    [ 133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
    [ 133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
    [ 133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
    [ 133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
    [ 133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
    [ 133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
    [ 133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
    [ 133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
    [ 133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250

    djbw: Aneesh notes that this crash can likely be triggered in any kernel that
    supports 'papr_scm', so flagging that commit for -stable consideration.

    Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions")
    Cc:
    Reported-by: Sachin Sant
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Pankaj Gupta
    Tested-by: Santosh Sivaraj
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • nd_label->dpa issue was observed when trying to enable the namespace created
    with little-endian kernel on a big-endian kernel. That made me run
    `sparse` on the rest of the code and other changes are the result of that.

    Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing")
    Fixes: 9dedc73a4658 ("libnvdimm/btt: Fix LBA masking during 'free list' population")
    Reviewed-by: Vishal Verma
    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190809074726.27815-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • Allow arch to provide the supported alignments and use hugepage alignment only
    if we support hugepage. Right now we depend on compile time configs whereas this
    patch switch this to runtime discovery.

    Architectures like ppc64 can have THP enabled in code, but then can have
    hugepage size disabled by the hypervisor. This allows us to create dax devices
    with PAGE_SIZE alignment in this case.

    Existing dax namespace with alignment larger than PAGE_SIZE will fail to
    initialize in this specific case. We still allow fsdax namespace initialization.

    With respect to identifying whether to enable hugepage fault for a dax device,
    if THP is enabled during compile, we default to taking hugepage fault and in dax
    fault handler if we find the fault size > alignment we retry with PAGE_SIZE
    fault size.

    This also addresses the below failure scenario on ppc64

    ndctl create-namespace --mode=devdax | grep align
    "align":16777216,
    "align":16777216

    cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
    65536 16777216

    daxio.static-debug -z -o /dev/dax0.0
    Bus error (core dumped)

    $ dmesg | tail
    lpar: Failed hash pte insert with error -4
    hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio
    hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc000000501002b86
    daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000]
    daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120
    daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 e93f0088 39290008 f93f0110

    The failure was due to guest kernel using wrong page size.

    The namespaces created with 16M alignment will appear as below on a config with
    16M page size disabled.

    $ ndctl list -Ni
    [
    {
    "dev":"namespace0.1",
    "mode":"fsdax",
    "map":"dev",
    "size":5351931904,
    "uuid":"fc6e9667-461a-4718-82b4-69b24570bddb",
    "align":16777216,
    "blockdev":"pmem0.1",
    "supported_alignments":[
    65536
    ]
    },
    {
    "dev":"namespace0.0",
    "mode":"fsdax",
    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-8-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

22 Sep, 2019

2 commits

  • Pull libnvdimm updates from Dan Williams:
    "Some reworks to better support nvdimms on powerpc and an nvdimm
    security interface update:

    - Rework the nvdimm core to accommodate architectures with different
    page sizes and ones that can change supported huge page sizes at
    boot time rather than a compile time constant.

    - Introduce a distinct 'frozen' attribute for the nvdimm security
    state since it is independent of the locked state.

    - Miscellaneous fixups"

    * tag 'libnvdimm-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: Use PAGE_SIZE instead of SZ_4K for align check
    libnvdimm/label: Remove the dpa align check
    libnvdimm/pfn_dev: Add page size and struct page size to pfn superblock
    libnvdimm/pfn_dev: Add a build check to make sure we notice when struct page size change
    libnvdimm/pmem: Advance namespace seed for specific probe errors
    libnvdimm/region: Rewrite _probe_success() to _advance_seeds()
    libnvdimm/security: Consolidate 'security' operations
    libnvdimm/security: Tighten scope of nvdimm->busy vs security operations
    libnvdimm/security: Introduce a 'frozen' attribute
    libnvdimm, region: Use struct_size() in kzalloc()
    tools/testing/nvdimm: Fix fallthrough warning
    libnvdimm/of_pmem: Provide a unique name for bus provider

    Linus Torvalds
     
  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

07 Sep, 2019

1 commit

  • The infrastructure to mock core libnvdimm routines for unit testing
    purposes is prone to bitrot relative to refactoring of that core. Arrange
    for the unit test core to be built when CONFIG_COMPILE_TEST=y. This does
    not result in a functional unit test environment, it is only a helper for
    0day to catch unit test build regressions.

    Note that there are a few x86isms in the implementation, so this does not
    bother compile testing this architectures other than 64-bit x86.

    Link: https://lore.kernel.org/r/156763690875.2556198.15786177395425033830.stgit@dwillia2-desk3.amr.corp.intel.com
    Reported-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Dan Williams
     

06 Sep, 2019

6 commits

  • Architectures have different page size than 4K. Use the PAGE_SIZE
    to make sure ranges are correctly aligned.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-7-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • There's no strict requirement why slot_valid() needs to check for page alignment
    and it would seem to actively hurt cross-page-size compatibility. Let's
    delete the check and rely on checksum validation.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-6-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • This is needed so that pmem probe don't wrongly initialize a namespace
    which doesn't have enough space reserved for holding struct pages
    with the current kernel.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-5-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • Namespaces created with PFN_MODE_PMEM mode stores struct page in the reserve
    block area. We need to make sure we account for the right struct page
    size while doing this. Instead of directly depending on sizeof(struct page)
    which can change based on different kernel config option, use the max struct
    page size (64) while calculating the reserve block area. This makes sure pmem
    device can be used across kernels built with different configs.

    If the above assumption of max struct page size change, we need to update the
    reserve block allocation space for new namespaces created.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-4-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • In order to support marking namespaces with unsupported feature/versions
    disabled, nvdimm core should advance the namespace seed on these
    probe failures. Otherwise, these failed namespaces will be considered a
    seed namespace and will be wrongly used while creating new namespaces.

    Add -EOPNOTSUPP as return from pmem probe callback to indicate a namespace
    initialization failures due to pfn superblock feature/version mismatch.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     
  • The nd_region_probe_success() helper collides seed management with
    nvdimm->busy tracking. Given the 'busy' increment is handled internal to the
    nd_region driver 'probe' path move the decrement to the 'remove' path.
    With that cleanup the routine can be renamed to the more descriptive
    nd_region_advance_seeds().

    The change is prompted by an incoming need to optionally advance the
    seeds on other events besides 'probe' success.

    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Dan Williams
    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Aug, 2019

4 commits

  • The security operations are exported from libnvdimm/security.c to
    libnvdimm/dimm_devs.c, and libnvdimm/security.c is optionally compiled
    based on the CONFIG_NVDIMM_KEYS config symbol.

    Rather than export the operations across compile objects, just move the
    __security_store() entry point to live with the helpers.

    Acked-by: Jeff Moyer
    Reviewed-by: Dave Jiang
    Link: https://lore.kernel.org/r/156686730515.184120.10522747907309996674.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • An attempt to freeze DIMMs currently runs afoul of default blocking of
    all security operations in the entry to the 'store' routine for the
    'security' sysfs attribute.

    The blanket blocking of all security operations while the DIMM is in
    active use in a region is too restrictive. The only security operations
    that need to be aware of the ->busy state are those that mutate the
    state of data, i.e. erase and overwrite.

    Refactor the ->busy checks to be applied at the entry common entry point
    in __security_store() rather than each of the helper routines to enable
    freeze to be run regardless of busy state.

    Reviewed-by: Dave Jiang
    Reviewed-by: Jeff Moyer
    Link: https://lore.kernel.org/r/156686729996.184120.3458026302402493937.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In the process of debugging a system with an NVDIMM that was failing to
    unlock it was found that the kernel is reporting 'locked' while the DIMM
    security interface is 'frozen'. Unfortunately the security state is
    tracked internally as an enum which prevents it from communicating the
    difference between 'locked' and 'locked + frozen'. It follows that the
    enum also prevents the kernel from communicating 'unlocked + frozen'
    which would be useful for debugging why security operations like 'change
    passphrase' are disabled.

    Ditch the security state enum for a set of flags and introduce a new
    sysfs attribute explicitly for the 'frozen' state. The regression risk
    is low because the 'frozen' state was already blocked behind the
    'locked' state, but will need to revisit if there were cases where
    applications need 'frozen' to show up in the primary 'security'
    attribute. The expectation is that communicating 'frozen' is mostly a
    helper for debug and status monitoring.

    Reviewed-by: Dave Jiang
    Reported-by: Jeff Moyer
    Reviewed-by: Jeff Moyer
    Link: https://lore.kernel.org/r/156686729474.184120.5835135644278860826.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct nd_region {
    ...
    struct nd_mapping mapping[0];
    };

    instance = kzalloc(sizeof(struct nd_region) + sizeof(struct nd_mapping) *
    count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kzalloc(struct_size(instance, mapping, count), GFP_KERNEL);

    This code was detected with the help of Coccinelle.

    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Vishal Verma
    Reviewed-by: Kees Cook
    Link: https://lore.kernel.org/r/20190610210613.GA21989@embeddedor
    Signed-off-by: Dan Williams

    Gustavo A. R. Silva
     

29 Aug, 2019

1 commit

  • Yi reported[1] that after commit a3619190d62e ("libnvdimm/pfn: stop
    padding pmem namespaces to section alignment"), it was no longer
    possible to create a device dax namespace with a 1G alignment. The
    reason was that the pmem region was not itself 1G-aligned. The code
    happily skips past the first 512M, but fails to account for a now
    misaligned end offset (since space was allocated starting at that
    misaligned address, and extending for size GBs). Reintroduce
    end_trunc, so that the code correctly handles the misaligned end
    address. This results in the same behavior as before the introduction
    of the offending commit.

    [1] https://lists.01.org/pipermail/linux-nvdimm/2019-July/022813.html

    Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces ...")
    Reported-and-tested-by: Yi Zhang
    Signed-off-by: Jeff Moyer
    Link: https://lore.kernel.org/r/x49ftll8f39.fsf@segfault.boston.devel.redhat.com
    Signed-off-by: Dan Williams

    Jeff Moyer
     

14 Aug, 2019

1 commit

  • ndctl binaries, v66 and older, mistakenly require the ndbus to have
    unique names. If not while enumerating the bus in userspace it drops bus
    with similar names. This results in us not listing devices beneath the
    bus.

    Signed-off-by: Aneesh Kumar K.V
    Tested-by: Vaibhav Jain
    Link: https://lore.kernel.org/r/20190807040029.11344-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

27 Jul, 2019

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "A collection of locking and async operations fixes for v5.3-rc2. These
    had been soaking in a branch targeting the merge window, but missed
    due to a regression hunt. This fixed up version has otherwise been in
    -next this past week with no reported issues.

    In order to gain confidence in the locking changes the pull also
    includes a debug / instrumentation patch to enable lockdep coverage
    for libnvdimm subsystem operations that depend on the device_lock for
    exclusion. As mentioned in the changelog it is a hack, but it works
    and documents the locking expectations of the sub-system in a way that
    others can use lockdep to verify. The driver core touches got an ack
    from Greg.

    Summary:

    - Fix duplicate device_unregister() calls (multiple threads competing
    to do unregister work when scheduling device removal from a sysfs
    attribute of the self-same device).

    - Fix badblocks registration order bug. Ensure region badblocks are
    initialized in advance of namespace registration.

    - Fix a deadlock between the bus lock and probe operations.

    - Export device-core infrastructure to coordinate async operations
    via the device ->dead state.

    - Add device-core infrastructure to validate device_lock() usage with
    lockdep"

    * tag 'libnvdimm-fixes-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    driver-core, libnvdimm: Let device subsystems add local lockdep coverage
    libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
    libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
    libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
    libnvdimm/region: Register badblocks before namespaces
    libnvdimm/bus: Prevent duplicate device_unregister() calls
    drivers/base: Introduce kill_device()

    Linus Torvalds
     

20 Jul, 2019

1 commit

  • Merge yet more updates from Andrew Morton:
    "The rest of MM and a kernel-wide procfs cleanup.

    Summary of the more significant patches:

    - Patch series "mm/memory_hotplug: Factor out memory block
    devicehandling", v3. David Hildenbrand.

    Some spring-cleaning of the memory hotplug code, notably in
    drivers/base/memory.c

    - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
    Shi.

    Fix /proc/pid/smaps output for THP pages used in shmem.

    - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.

    Bugfix and speedup for kernel/resource.c

    - Patch series "mm: Further memory block device cleanups", David
    Hildenbrand.

    More spring-cleaning of the memory hotplug code.

    - Patch series "mm: Sub-section memory hotplug support". Dan
    Williams.

    Generalise the memory hotplug code so that pmem can use it more
    completely. Then remove the hacks from the libnvdimm code which
    were there to work around the memory-hotplug code's constraints.

    - "proc/sysctl: add shared variables for range check", Matteo Croce.

    We have about 250 instances of

    int zero;
    ...
    .extra1 = &zero,

    in the tree. This is a tree-wide sweep to make all those private
    "zero"s and "one"s use global variables.

    Alas, it isn't practical to make those two global integers const"

    * emailed patches from Andrew Morton : (38 commits)
    proc/sysctl: add shared variables for range check
    mm: migrate: remove unused mode argument
    mm/sparsemem: cleanup 'section number' data types
    libnvdimm/pfn: stop padding pmem namespaces to section alignment
    libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
    mm/devm_memremap_pages: enable sub-section remap
    mm: document ZONE_DEVICE memory-model implications
    mm/sparsemem: support sub-section hotplug
    mm/sparsemem: prepare for sub-section ranges
    mm: kill is_dev_zone() helper
    mm/hotplug: kill is_dev_zone() usage in __remove_pages()
    mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
    mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
    mm/sparsemem: add helpers track active portions of a section at boot
    mm/sparsemem: introduce a SECTION_IS_EARLY flag
    mm/sparsemem: introduce struct mem_section_usage
    drivers/base/memory.c: get rid of find_memory_block_hinted()
    mm/memory_hotplug: move and simplify walk_memory_blocks()
    mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
    mm: make register_mem_sect_under_node() static
    ...

    Linus Torvalds
     

19 Jul, 2019

9 commits

  • Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
    memory, we no longer need to add padding at pfn/dax device creation
    time. The kernel will still honor padding established by older kernels.

    Link: http://lkml.kernel.org/r/156092356588.979959.6793371748950931916.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Jeff Moyer
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Logan Gunthorpe
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • At namespace creation time there is the potential for the "expected to
    be zero" fields of a 'pfn' info-block to be filled with indeterminate
    data. While the kernel buffer is zeroed on allocation it is immediately
    overwritten by nd_pfn_validate() filling it with the current contents of
    the on-media info-block location. For fields like, 'flags' and the
    'padding' it potentially means that future implementations can not rely on
    those fields being zero.

    In preparation to stop using the 'start_pad' and 'end_trunc' fields for
    section alignment, arrange for fields that are not explicitly
    initialized to be guaranteed zero. Bump the minor version to indicate
    it is safe to assume the 'padding' and 'flags' are zero. Otherwise,
    this corruption is expected to benign since all other critical fields
    are explicitly initialized.

    Note The cc: stable is about spreading this new policy to as many
    kernels as possible not fixing an issue in those kernels. It is not
    until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
    section alignment" where this improper initialization becomes a problem.
    So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
    namespaces to section alignment" (which is not tagged for stable), make
    sure this pre-requisite is flagged.

    Link: http://lkml.kernel.org/r/156092356065.979959.6681003754765958296.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
    Signed-off-by: Dan Williams
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc:
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Logan Gunthorpe
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • For good reason, the standard device_lock() is marked
    lockdep_set_novalidate_class() because there is simply no sane way to
    describe the myriad ways the device_lock() ordered with other locks.
    However, that leaves subsystems that know their own local device_lock()
    ordering rules to find lock ordering mistakes manually. Instead,
    introduce an optional / additional lockdep-enabled lock that a subsystem
    can acquire in all the same paths that the device_lock() is acquired.

    A conversion of the NFIT driver and NVDIMM subsystem to a
    lockdep-validate device_lock() scheme is included. The
    debug_nvdimm_lock() implementation implements the correct lock-class and
    stacking order for the libnvdimm device topology hierarchy.

    Yes, this is a hack, but hopefully it is a useful hack for other
    subsystems device_lock() debug sessions. Quoting Greg:

    "Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
    using it as much as anything else, so user beware :)

    I don't object to it if it makes things easier for you to debug."

    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Will Deacon
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Peter Zijlstra
    Cc: Vishal Verma
    Cc: "Rafael J. Wysocki"
    Cc: Greg Kroah-Hartman
    Signed-off-by: Dan Williams
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Ira Weiny
    Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • A multithreaded namespace creation/destruction stress test currently
    deadlocks with the following lockup signature:

    INFO: task ndctl:2924 blocked for more than 122 seconds.
    Tainted: G OE 5.2.0-rc4+ #3382
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    ndctl D 0 2924 1176 0x00000000
    Call Trace:
    ? __schedule+0x27e/0x780
    schedule+0x30/0xb0
    wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
    ? finish_wait+0x80/0x80
    uuid_store+0xe6/0x2e0 [libnvdimm]
    kernfs_fop_write+0xf0/0x1a0
    vfs_write+0xb7/0x1b0
    ksys_write+0x5c/0xd0
    do_syscall_64+0x60/0x240

    INFO: task ndctl:2923 blocked for more than 122 seconds.
    Tainted: G OE 5.2.0-rc4+ #3382
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    ndctl D 0 2923 1175 0x00000000
    Call Trace:
    ? __schedule+0x27e/0x780
    ? __mutex_lock+0x489/0x910
    schedule+0x30/0xb0
    schedule_preempt_disabled+0x11/0x20
    __mutex_lock+0x48e/0x910
    ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
    ? __lock_acquire+0x23f/0x1710
    ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
    nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
    __dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
    ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
    dax_pmem_probe+0xc/0x20 [dax_pmem]
    nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
    really_probe+0xef/0x390
    driver_probe_device+0xb4/0x100

    In this sequence an 'nd_dax' device is being probed and trying to take
    the lock on its backing namespace to validate that the 'nd_dax' device
    indeed has exclusive access to the backing namespace. Meanwhile, another
    thread is trying to update the uuid property of that same backing
    namespace. So one thread is in the probe path trying to acquire the
    lock, and the other thread has acquired the lock and tries to flush the
    probe path.

    Fix this deadlock by not holding the namespace device_lock over the
    wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
    the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
    subsequently dropped internally to wait_nvdimm_bus_probe_idle().

    Cc:
    Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
    Cc: Vishal Verma
    Tested-by: Jane Chu
    Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for fixing a deadlock between wait_for_bus_probe_idle()
    and the nvdimm_bus_list_mutex arrange for __nd_ioctl() without
    nvdimm_bus_list_mutex held. This also unifies the 'dimm' and 'bus' level
    ioctls into a common nd_ioctl() preamble implementation.

    Marked for -stable as it is a pre-requisite for a follow-on fix.

    Cc:
    Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
    Cc: Vishal Verma
    Tested-by: Jane Chu
    Link: https://lore.kernel.org/r/156341209518.292348.7183897251740665198.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for not holding a lock over the execution of nd_ioctl(),
    update the implementation to allow multiple threads to be attempting
    ioctls at the same time. The bus lock still prevents multiple in-flight
    ->ndctl() invocations from corrupting each other's state, but static
    global staging buffers are moved to the heap.

    Reported-by: Vishal Verma
    Reviewed-by: Vishal Verma
    Tested-by: Vishal Verma
    Link: https://lore.kernel.org/r/156341208947.292348.10560140326807607481.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Namespace activation expects to be able to reference region badblocks.
    The following warning sometimes triggers when asynchronous namespace
    activation races in front of the completion of namespace probing. Move
    all possible namespace probing after region badblocks initialization.

    Otherwise, lockdep sometimes catches the uninitialized state of the
    badblocks seqlock with stack trace signatures like:

    INFO: trying to register non-static key.
    pmem2: detected capacity change from 0 to 136365211648
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G OE 5.2.0-rc4+ #3382
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: events_unbound async_run_entry_fn
    Call Trace:
    dump_stack+0x85/0xc0
    pmem1.12: detected capacity change from 0 to 8589934592
    register_lock_class+0x56a/0x570
    ? check_object+0x140/0x270
    __lock_acquire+0x80/0x1710
    ? __mutex_lock+0x39d/0x910
    lock_acquire+0x9e/0x180
    ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
    badblocks_check+0x93/0x1f0
    ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
    nd_pfn_validate+0x28f/0x440 [libnvdimm]
    ? lockdep_hardirqs_on+0xf0/0x180
    nd_dax_probe+0x9a/0x120 [libnvdimm]
    nd_pmem_probe+0x6d/0x180 [nd_pmem]
    nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]

    Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...")
    Cc:
    Cc: Vishal Verma
    Reviewed-by: Vishal Verma
    Link: https://lore.kernel.org/r/156341208365.292348.1547528796026249120.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • A multithreaded namespace creation/destruction stress test currently
    fails with signatures like the following:

    sysfs group 'power' not found for kobject 'dax1.1'
    RIP: 0010:sysfs_remove_group+0x76/0x80
    Call Trace:
    device_del+0x73/0x370
    device_unregister+0x16/0x50
    nd_async_device_unregister+0x1e/0x30 [libnvdimm]
    async_run_entry_fn+0x39/0x160
    process_one_work+0x23c/0x5e0
    worker_thread+0x3c/0x390

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    RIP: 0010:klist_put+0x1b/0x6c
    Call Trace:
    klist_del+0xe/0x10
    device_del+0x8a/0x2c9
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    device_unregister+0x44/0x4f
    nd_async_device_unregister+0x22/0x2d [libnvdimm]
    async_run_entry_fn+0x47/0x15a
    process_one_work+0x1a2/0x2eb
    worker_thread+0x1b8/0x26e

    Use the kill_device() helper to atomically resolve the race of multiple
    threads issuing kill, device_unregister(), requests.

    Reported-by: Jane Chu
    Reported-by: Erwin Tsaur
    Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
    Cc:
    Link: https://github.com/pmem/ndctl/issues/96
    Tested-by: Tested-by: Jane Chu
    Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Pull libnvdimm updates from Dan Williams:
    "Primarily just the virtio_pmem driver:

    - virtio_pmem

    The new virtio_pmem facility introduces a paravirtualized
    persistent memory device that allows a guest VM to use DAX
    mechanisms to access a host-file with host-page-cache. It arranges
    for MAP_SYNC to be disabled and instead triggers a host fsync()
    when a 'write-cache flush' command is sent to the virtual disk
    device.

    - Miscellaneous small fixups"

    * tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    virtio_pmem: fix sparse warning
    xfs: disable map_sync for async flush
    ext4: disable map_sync for async flush
    dax: check synchronous mapping is supported
    dm: enable synchronous dax
    libnvdimm: add dax_dev sync flag
    virtio-pmem: Add virtio pmem driver
    libnvdimm: nd_region flush callback support
    libnvdimm, namespace: Drop uuid_t implementation detail

    Linus Torvalds
     

17 Jul, 2019

1 commit

  • This patch fixes below sparse warning related to __virtio
    type in virtio pmem driver. This is reported by Intel test
    bot on linux-next tree.

    nd_virtio.c:56:28: warning: incorrect type in assignment
    (different base types)
    nd_virtio.c:56:28: expected unsigned int [unsigned] [usertype] type
    nd_virtio.c:56:28: got restricted __virtio32
    nd_virtio.c:93:59: warning: incorrect type in argument 2
    (different base types)
    nd_virtio.c:93:59: expected restricted __virtio32 [usertype] val
    nd_virtio.c:93:59: got unsigned int [unsigned] [usertype] ret

    Reported-by: kbuild test robot
    Signed-off-by: Pankaj Gupta
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Dan Williams

    Pankaj Gupta
     

15 Jul, 2019

2 commits


06 Jul, 2019

3 commits

  • This patch adds 'DAXDEV_SYNC' flag which is set
    for nd_region doing synchronous flush. This later
    is used to disable MAP_SYNC functionality for
    ext4 & xfs filesystem for devices don't support
    synchronous flush.

    Signed-off-by: Pankaj Gupta
    Signed-off-by: Dan Williams

    Pankaj Gupta
     
  • This patch adds virtio-pmem driver for KVM guest.

    Guest reads the persistent memory range information from
    Qemu over VIRTIO and registers it on nvdimm_bus. It also
    creates a nd_region object with the persistent memory
    range information so that existing 'nvdimm/pmem' driver
    can reserve this into system memory map. This way
    'virtio-pmem' driver uses existing functionality of pmem
    driver to register persistent memory compatible for DAX
    capable filesystems.

    This also provides function to perform guest flush over
    VIRTIO from 'pmem' driver when userspace performs flush
    on DAX memory range.

    Signed-off-by: Pankaj Gupta
    Reviewed-by: Yuval Shaia
    Acked-by: Michael S. Tsirkin
    Acked-by: Jakub Staron
    Tested-by: Jakub Staron
    Reviewed-by: Cornelia Huck
    Signed-off-by: Dan Williams

    Pankaj Gupta
     
  • This patch adds functionality to perform flush from guest
    to host over VIRTIO. We are registering a callback based
    on 'nd_region' type. virtio_pmem driver requires this special
    flush function. For rest of the region types we are registering
    existing flush function. Report error returned by host fsync
    failure to userspace.

    Signed-off-by: Pankaj Gupta
    Signed-off-by: Dan Williams

    Pankaj Gupta