28 Jun, 2017

1 commit


13 Jun, 2017

1 commit


09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

05 Jun, 2017

1 commit

  • Hoist the libnvdimm helper as an inline helper to linux/uuid.h
    using an auxiliary const variable uuid_null in lib/uuid.c.

    [hch: also add the guid variant. Both do the same but I'd like
    to keep casts to a minimum]

    The common helper uses the new abstract type uuid_t * instead of
    u8 *.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Amir Goldstein
    [hch: added guid_is_null]
    Signed-off-by: Christoph Hellwig
    Acked-by: Dan Williams
    Reviewed-by: Andy Shevchenko

    Christoph Hellwig
     

13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

11 May, 2017

2 commits

  • If we had badblocks/poison in the metadata area of a BTT, recreating the
    BTT would not clear the poison in all cases, notably the flog area. This
    is because rw_bytes will only clear errors if the request being sent
    down is 512B aligned and sized.

    Make sure that when writing the map and info blocks, the rw_bytes being
    sent are of the correct size/alignment. For the flog, instead of doing
    the smaller log_entry writes only, first do a 'wipe' of the entire area
    by writing zeroes in large enough chunks so that errors get cleared.

    Cc: Andy Rudoff
    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • nsio_rw_bytes can clear media errors, but this cannot be done while we
    are in an atomic context due to locking within ACPI. From the BTT,
    ->rw_bytes may be called either from atomic or process context depending
    on whether the calls happen during initialization or during IO.

    During init, we want to ensure error clearing happens, and the flag
    marking process context allows nsio_rw_bytes to do that. When called
    during IO, we're in atomic context, and error clearing can be skipped.

    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

09 May, 2017

1 commit

  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

05 May, 2017

4 commits

  • Dan Williams
     
  • Fix failures to create namespaces due to the vmem_altmap not advertising
    enough free space to store the memmap.

    WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0
    [..]
    Call Trace:
    dump_stack+0x63/0x83
    __warn+0xcb/0xf0
    warn_slowpath_null+0x1d/0x20
    arch_add_memory+0xde/0xf0
    devm_memremap_pages+0x244/0x440
    pmem_attach_disk+0x37e/0x490 [nd_pmem]
    nd_pmem_probe+0x7e/0xa0 [nd_pmem]
    nvdimm_bus_probe+0x71/0x120 [libnvdimm]
    driver_probe_device+0x2bb/0x460
    bind_store+0x114/0x160
    drv_attr_store+0x25/0x30

    In commit 658922e57b84 "libnvdimm, pfn: fix memmap reservation sizing"
    we arranged for the capacity to be allocated, but failed to also update
    the 'npfns' parameter. This leads to cases where there is enough
    capacity reserved to hold all the allocated sections, but
    vmemmap_populate_hugepages() still encounters -ENOMEM from
    altmap_alloc_block_buf().

    This fix is a stop-gap until we can teach the core memory hotplug
    implementation to permit sub-section hotplug.

    Cc:
    Fixes: 658922e57b84 ("libnvdimm, pfn: fix memmap reservation sizing")
    Reported-by: Anisha Allada
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Per the latest version of the "NVDIMM DSM Interface Example" [1], the
    label data retrieval routine can report a "locked" status. In this case
    all regions associated with that DIMM are disabled until the label area
    is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label
    data area locking capabilities.

    [1]: http://pmem.io/documents/

    Signed-off-by: Dan Williams

    Dan Williams
     
  • This is a preparation patch for handling locked nvdimm label regions, a
    new concept as introduced by the latest DSM document on pmem.io [1]. A
    future patch will leverage nvdimm_set_locked() at DIMM probe time to
    flag regions that can not be enabled. There should be no functional
    difference resulting from this change.

    [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf

    Signed-off-by: Dan Williams

    Dan Williams
     

02 May, 2017

2 commits

  • Pull x86 mm updates from Ingo Molnar:
    "The main x86 MM changes in this cycle were:

    - continued native kernel PCID support preparation patches to the TLB
    flushing code (Andy Lutomirski)

    - various fixes related to 32-bit compat syscall returning address
    over 4Gb in applications, launched from 64-bit binaries - motivated
    by C/R frameworks such as Virtuozzo. (Dmitry Safonov)

    - continued Intel 5-level paging enablement: in particular the
    conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)

    - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)

    - ... plus misc updates, fixes and cleanups"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
    mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
    x86/mm: Fix flush_tlb_page() on Xen
    x86/mm: Make flush_tlb_mm_range() more predictable
    x86/mm: Remove flush_tlb() and flush_tlb_current_task()
    x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
    x86/mm/64: Fix crash in remove_pagetable()
    Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
    x86/boot/e820: Remove a redundant self assignment
    x86/mm: Fix dump pagetables for 4 levels of page tables
    x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
    x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
    Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
    x86/espfix: Add support for 5-level paging
    x86/kasan: Extend KASAN to support 5-level paging
    x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
    x86/paravirt: Add 5-level support to the paravirt code
    x86/mm: Define virtual memory map for 5-level paging
    x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
    x86/boot: Detect 5-level paging support
    x86/mm/numa: Remove numa_nodemask_from_meminfo()
    ...

    Linus Torvalds
     
  • This continues the 4.11 status quo of disabling of error clearing from
    the BTT I/O path. Toshi found that even though we have eliminated all
    the libnvdimm sources of sleeping-while-atomic triggers, we still have
    sleeping operations that will occur in the path to send the ACPI DSM to
    the DIMM to clear the error:

    BUG: sleeping function called from invalid context at mm/slab.h:432
    in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd
    Call Trace:
    dump_stack+0x86/0xc3
    ___might_sleep+0x17d/0x250
    __might_sleep+0x4a/0x80
    __kmalloc+0x1c0/0x2e0
    acpi_os_allocate_zeroed+0x2d/0x2f
    acpi_evaluate_object+0x59/0x3b1
    acpi_evaluate_dsm+0xbd/0x10c
    acpi_nfit_ctl+0x1ef/0x7c0 [nfit]
    ? nsio_rw_bytes+0x152/0x280
    nvdimm_clear_poison+0x77/0x140
    nsio_rw_bytes+0x18f/0x280
    btt_write_pg+0x1d4/0x3d0 [nd_btt]
    btt_make_request+0x119/0x2d0 [nd_btt]

    A solution for tracking and handling media errors natively in the BTT is
    needed.

    Cc: Jeff Moyer
    Cc: Dave Jiang
    Cc: Vishal Verma
    Reported-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

01 May, 2017

2 commits

  • A debug patch to turn the standard device_lock() into something that
    lockdep can analyze yielded the following:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.11.0-rc4+ #106 Tainted: G O
    -------------------------------------------------------
    lt-libndctl/1898 is trying to acquire lock:
    (&dev->nvdimm_mutex/3){+.+.+.}, at: [] nd_attach_ndns+0x178/0x1b0 [libnvdimm]

    but task is already holding lock:
    (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [] nvdimm_bus_lock+0x21/0x30 [libnvdimm]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
    lock_acquire+0xf6/0x1f0
    __mutex_lock+0x88/0x980
    mutex_lock_nested+0x1b/0x20
    nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm]
    nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm]
    nd_pmem_probe+0x14/0x180 [nd_pmem]
    nvdimm_bus_probe+0xa9/0x260 [libnvdimm]

    -> #0 (&dev->nvdimm_mutex/3){+.+.+.}:
    __lock_acquire+0x1107/0x1280
    lock_acquire+0xf6/0x1f0
    __mutex_lock+0x88/0x980
    mutex_lock_nested+0x1b/0x20
    nd_attach_ndns+0x178/0x1b0 [libnvdimm]
    nd_namespace_store+0x308/0x3c0 [libnvdimm]
    namespace_store+0x87/0x220 [libnvdimm]

    In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'.

    Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect
    nd_{attach,detach}_ndns() operations.

    Cc:
    Fixes: 8c2f7e8658df ("libnvdimm: infrastructure for btt devices")
    Reported-by: Yi Zhang
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The x86 conversion to the generic GUP code included a small change which causes
    crashes and data corruption in the pmem code - not good.

    The root cause is that the /dev/pmem driver code implicitly relies on the x86
    get_user_pages() implementation doing a get_page() on the page refcount, because
    get_page() does a get_zone_device_page() which properly refcounts pmem's separate
    page struct arrays that are not present in the regular page struct structures.
    (The pmem driver does this because it can cover huge memory areas.)

    But the x86 conversion to the generic GUP code changed the get_page() to
    page_cache_get_speculative() which is faster but doesn't do the
    get_zone_device_page() call the pmem code relies on.

    One way to solve the regression would be to change the generic GUP code to use
    get_page(), but that would slow things down a bit and punish other generic-GUP
    using architectures for an x86-ism they did not care about. (Arguably the pmem
    driver was probably not working reliably for them: but nvdimm is an Intel
    feature, so non-x86 exposure is probably still limited.)

    So restructure the pmem code's interface with the MM instead: get rid of the
    get/put_zone_device_page() distinction, integrate put_zone_device_page() into
    __put_page() and and restructure the pmem completion-wait and teardown machinery:

    Kirill points out that the calls to {get,put}_dev_pagemap() can be
    removed from the mm fast path if we take a single get_dev_pagemap()
    reference to signify that the page is alive and use the final put of the
    page to drop that reference.

    This does require some care to make sure that any waits for the
    percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
    since it now maintains its own elevated reference.

    This speeds up things while also making the pmem refcounting more robust going
    forward.

    Suggested-by: Kirill Shutemov
    Tested-by: Kirill Shutemov
    Signed-off-by: Dan Williams
    Reviewed-by: Logan Gunthorpe
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Jérôme Glisse
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar

    Dan Williams
     

30 Apr, 2017

1 commit

  • Toshi noticed that the new support for a region-level badblocks missed
    the case where errors are cleared due to BTT I/O.

    An initial attempt to fix this ran into a "sleeping while atomic"
    warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
    satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
    However, that lock is not needed since we are not acting on any data that
    is subject to change under that lock. The badblocks instance has its own
    internal lock to handle mutations of the error list.

    So, in order to make it clear that we are just acting on region devices,
    rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
    Eliminate the lock and consolidate all support routines for the new
    nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
    opportunity to cleanup to some unnecessary casts, make the calling
    convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
    resource with the minimal struct clear_badblocks_context, and use the
    DEVICE_ATTR macro.

    Cc: Dave Jiang
    Cc: Vishal Verma
    Reported-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

29 Apr, 2017

3 commits

  • ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length
    of error actually cleared, which may be smaller than its requested
    'len'.

    Change nvdimm_clear_poison() to call nvdimm_forget_poison() with
    'clear_err.cleared' when this value is valid.

    Cc:
    Fixes: e046114af5fc ("libnvdimm: clear the internal poison_list when clearing badblocks")
    Cc: Dave Jiang
    Cc: Vishal Verma
    Signed-off-by: Toshi Kani
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • The following BUG was observed when nd_pmem_notify() was called
    for a BTT device. The use of a pmem_device pointer is not valid
    with BTT.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
    IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
    Call Trace:
    nd_device_notify+0x40/0x50
    child_notify+0x10/0x20
    device_for_each_child+0x50/0x90
    nd_region_notify+0x20/0x30
    nd_device_notify+0x40/0x50
    nvdimm_region_notify+0x27/0x30
    acpi_nfit_scrub+0x341/0x590 [nfit]
    process_one_work+0x197/0x450
    worker_thread+0x4e/0x4a0
    kthread+0x109/0x140

    Fix nd_pmem_notify() by setting nd_region and badblocks pointers
    properly for BTT.

    Cc:
    Cc: Vishal Verma
    Fixes: 719994660c24 ("libnvdimm: async notification support")
    Signed-off-by: Toshi Kani
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • The nvdimm_flush() mechanism helps to reduce the impact of an ADR
    (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
    platform WPQ (write-pending-queue) buffers when power is removed. The
    nvdimm_flush() mechanism performs that same function on-demand.

    When a pmem namespace is associated with a block device, an
    nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
    request. These requests are typically associated with filesystem
    metadata updates. However, when a namespace is in device-dax mode,
    userspace (think database metadata) needs another path to perform the
    same flushing. In other words this is not required to make data
    persistent, but in the case of metadata it allows for a smaller failure
    domain in the unlikely event of an ADR failure.

    The new 'deep_flush' attribute is visible when the individual DIMMs
    backing a given interleave-set are described by platform firmware. In
    ACPI terms this is "NVDIMM Region Mapping Structures" and associated
    "Flush Hint Address Structures". Reads return "1" if the region supports
    triggering WPQ flushes on all DIMMs. Reads return "0" the flush
    operation is a platform nop, and in that case the attribute is
    read-only.

    Why sysfs and not an ioctl? An ioctl requires establishing a new
    ioctl function number space for device-dax. Given that this would be
    called on a device-dax fd an application could be forgiven for
    accidentally calling this on a filesystem-dax fd. Placing this interface
    in libnvdimm sysfs removes that potential for collision with a
    filesystem ioctl, and it keeps ioctls out of the generic device-dax
    implementation.

    Cc: Jeff Moyer
    Cc: Masayoshi Mizuma
    Signed-off-by: Dan Williams

    Dan Williams
     

28 Apr, 2017

1 commit


26 Apr, 2017

2 commits

  • memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
    serves no real benefit aside from affording a more generic function name
    than the x86-specific 'mcsafe'. However this would not be the first time
    that x86 terminology leaked into the global namespace. For lack of
    better name, just use memcpy_mcsafe() directly.

    This conversion also catches a place where we should have been using
    plain memcpy, acpi_nfit_blk_single_io().

    Cc:
    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Acked-by: Tony Luck
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Now that all the producers and consumers of dax interfaces have been
    converted to using dax_operations on a dax_device, remove the block
    device direct_access enabling.

    Signed-off-by: Dan Williams

    Dan Williams
     

25 Apr, 2017

1 commit

  • In the case where a dimm does not have any associated flush hints the
    ndrd->flush_wpq array may be uninitialized leading to crashes with the
    following signature:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: region_visible+0x10f/0x160 [libnvdimm]

    Call Trace:
    internal_create_group+0xbe/0x2f0
    sysfs_create_groups+0x40/0x80
    device_add+0x2d8/0x650
    nd_async_device_register+0x12/0x40 [libnvdimm]
    async_run_entry_fn+0x39/0x170
    process_one_work+0x212/0x6c0
    ? process_one_work+0x197/0x6c0
    worker_thread+0x4e/0x4a0
    kthread+0x10c/0x140
    ? process_one_work+0x6c0/0x6c0
    ? kthread_create_on_node+0x60/0x60
    ret_from_fork+0x31/0x40

    Cc:
    Reviewed-by: Jeff Moyer
    Fixes: f284a4f23752 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
    Signed-off-by: Dan Williams

    Dan Williams
     

20 Apr, 2017

1 commit

  • Setup a dax_device to have the same lifetime as the pmem block device
    and add a ->direct_access() method that is equivalent to
    pmem_direct_access(). Once fs/dax.c has been converted to use
    dax_operations the old pmem_direct_access() will be removed.

    Signed-off-by: Dan Williams

    Dan Williams
     

15 Apr, 2017

1 commit

  • This reverts commit 4aa5615e080a "libnvdimm: band aid btt vs clear
    poison locking".

    Now that poison list locking has been converted to a spinlock and poison
    list entry allocation during i/o has been converted to GFP_NOWAIT,
    revert the band-aid that disabled error clearing from btt i/o.

    Cc: Vishal Verma
    Cc: Dave Jiang
    Signed-off-by: Dan Williams

    Dan Williams
     

14 Apr, 2017

1 commit

  • The following warning results from holding a lane spinlock,
    preempt_disable(), or the btt map spinlock and then trying to take the
    reconfig_mutex to walk the poison list and potentially add new entries.

    BUG: sleeping function called from invalid context at kernel/locking/mutex.
    c:747
    in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
    [..]
    Call Trace:
    dump_stack+0x85/0xc8
    ___might_sleep+0x184/0x250
    __might_sleep+0x4a/0x90
    __mutex_lock+0x58/0x9b0
    ? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
    ? acpi_nfit_forget_poison+0x79/0x80 [nfit]
    ? _raw_spin_unlock+0x27/0x40
    mutex_lock_nested+0x1b/0x20
    nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    nvdimm_forget_poison+0x25/0x50 [libnvdimm]
    nvdimm_clear_poison+0x106/0x140 [libnvdimm]
    nsio_rw_bytes+0x164/0x270 [libnvdimm]
    btt_write_pg+0x1de/0x3e0 [nd_btt]
    ? blk_queue_enter+0x30/0x290
    btt_make_request+0x11a/0x310 [nd_btt]
    ? blk_queue_enter+0xb7/0x290
    ? blk_queue_enter+0x30/0x290
    generic_make_request+0x118/0x3b0

    A spinlock is introduced to protect the poison list. This allows us to not
    having to acquire the reconfig_mutex for touching the poison list. The
    add_poison() function has been broken out into two helper functions. One to
    allocate the poison entry and the other to apppend the entry. This allows us
    to unlock the poison_lock in non-I/O path and continue to be able to allocate
    the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in
    order to satisfy being in atomic context.

    Reviewed-by: Vishal Verma
    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

13 Apr, 2017

3 commits


11 Apr, 2017

2 commits

  • The following warning results from holding a lane spinlock,
    preempt_disable(), or the btt map spinlock and then trying to take the
    reconfig_mutex to walk the poison list and potentially add new entries.

    BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
    in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
    [..]
    Call Trace:
    dump_stack+0x85/0xc8
    ___might_sleep+0x184/0x250
    __might_sleep+0x4a/0x90
    __mutex_lock+0x58/0x9b0
    ? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
    ? acpi_nfit_forget_poison+0x79/0x80 [nfit]
    ? _raw_spin_unlock+0x27/0x40
    mutex_lock_nested+0x1b/0x20
    nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    nvdimm_forget_poison+0x25/0x50 [libnvdimm]
    nvdimm_clear_poison+0x106/0x140 [libnvdimm]
    nsio_rw_bytes+0x164/0x270 [libnvdimm]
    btt_write_pg+0x1de/0x3e0 [nd_btt]
    ? blk_queue_enter+0x30/0x290
    btt_make_request+0x11a/0x310 [nd_btt]
    ? blk_queue_enter+0xb7/0x290
    ? blk_queue_enter+0x30/0x290
    generic_make_request+0x118/0x3b0

    As a minimal fix, disable error clearing when the BTT is enabled for the
    namespace. For the final fix a larger rework of the poison list locking
    is needed.

    Note that this is not a problem in the blk case since that path never
    calls nvdimm_clear_poison().

    Cc:
    Fixes: 82bf1037f2ca ("libnvdimm: check and clear poison before writing to pmem")
    Cc: Dave Jiang
    [jeff: dynamically disable error clearing in the btt case]
    Suggested-by: Jeff Moyer
    Reviewed-by: Jeff Moyer
    Reported-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Holding the reconfig_mutex over a potential userspace fault sets up a
    lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl
    path. Move the user access outside of the lock.

    [ INFO: possible circular locking dependency detected ]
    4.11.0-rc3+ #13 Tainted: G W O
    -------------------------------------------------------
    fallocate/16656 is trying to acquire lock:
    (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    but task is already holding lock:
    (jbd2_handle){++++..}, at: [] start_this_handle+0x104/0x460

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (jbd2_handle){++++..}:
    lock_acquire+0xbd/0x200
    start_this_handle+0x16a/0x460
    jbd2__journal_start+0xe9/0x2d0
    __ext4_journal_start_sb+0x89/0x1c0
    ext4_dirty_inode+0x32/0x70
    __mark_inode_dirty+0x235/0x670
    generic_update_time+0x87/0xd0
    touch_atime+0xa9/0xd0
    ext4_file_mmap+0x90/0xb0
    mmap_region+0x370/0x5b0
    do_mmap+0x415/0x4f0
    vm_mmap_pgoff+0xd7/0x120
    SyS_mmap_pgoff+0x1c5/0x290
    SyS_mmap+0x22/0x30
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    -> #1 (&mm->mmap_sem){++++++}:
    lock_acquire+0xbd/0x200
    __might_fault+0x70/0xa0
    __nd_ioctl+0x683/0x720 [libnvdimm]
    nvdimm_ioctl+0x8b/0xe0 [libnvdimm]
    do_vfs_ioctl+0xa8/0x740
    SyS_ioctl+0x79/0x90
    do_syscall_64+0x6c/0x200
    return_from_SYSCALL_64+0x0/0x7a

    -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
    __lock_acquire+0x16b6/0x1730
    lock_acquire+0xbd/0x200
    __mutex_lock+0x88/0x9b0
    mutex_lock_nested+0x1b/0x20
    nvdimm_bus_lock+0x21/0x30 [libnvdimm]
    nvdimm_forget_poison+0x25/0x50 [libnvdimm]
    nvdimm_clear_poison+0x106/0x140 [libnvdimm]
    pmem_do_bvec+0x1c2/0x2b0 [nd_pmem]
    pmem_make_request+0xf9/0x270 [nd_pmem]
    generic_make_request+0x118/0x3b0
    submit_bio+0x75/0x150

    Cc:
    Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
    Cc: Dave Jiang
    Reported-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Apr, 2017

1 commit

  • Commit a1f3e4d6a0c3 "libnvdimm, region: update nd_region_available_dpa()
    for multi-pmem support" reworked blk dpa (DIMM Physical Address)
    accounting to comprehend multiple pmem namespace allocations aliasing
    with a given blk-dpa range.

    The following call trace is a result of failing to account for allocated
    blk capacity.

    WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names
    4 size_store+0x6f3/0x930 [libnvdimm]
    nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes
    [..]
    Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    size_store+0x6f3/0x930 [libnvdimm]
    dev_attr_store+0x18/0x30

    If a given blk-dpa allocation does not alias with any pmem ranges then
    the full allocation should be accounted as busy space, not the size of
    the current pmem contribution to the region.

    The thinkos that led to this confusion was not realizing that the struct
    resource management is already guaranteeing no collisions between pmem
    allocations and blk allocations on the same dimm. Also, we do not try to
    support blk allocations in aliased pmem holes.

    This patch also fixes a case where the available blk goes negative.

    Cc:
    Fixes: a1f3e4d6a0c3 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support").
    Reported-by: Dariusz Dokupil
    Reported-by: Dave Jiang
    Reported-by: Vishal Verma
    Tested-by: Dave Jiang
    Tested-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     

01 Mar, 2017

1 commit

  • The interleave-set cookie is a sum that sanity checks the composition of
    an interleave set has not changed from when the namespace was initially
    created. The checksum is calculated by sorting the DIMMs by their
    location in the interleave-set. The comparison for the sort must be
    64-bit wide, not byte-by-byte as performed by memcmp() in the broken
    case.

    Fix the implementation to accept correct cookie values in addition to
    the Linux "memcmp" order cookies, but only allow correct cookies to be
    generated going forward. It does mean that namespaces created by
    third-party-tooling, or created by newer kernels with this fix, will not
    validate on older kernels. However, there are a couple mitigating
    conditions:

    1/ platforms with namespace-label capable NVDIMMs are not widely
    available.

    2/ interleave-sets with a single-dimm are by definition not affected
    (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.

    The cookie stored in the namespace label will be fixed by any write the
    namespace label, the most straightforward way to achieve this is to
    write to the "alt_name" attribute of a namespace in sysfs.

    Cc:
    Fixes: eaf961536e16 ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
    Reported-by: Nicholas Moulin
    Tested-by: Nicholas Moulin
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Feb, 2017

1 commit

  • When vmemmap_populate() allocates space for the memmap it does so in 2MB
    sized chunks. The libnvdimm-pfn driver incorrectly accounts for this
    when the alignment of the device is set to 4K. When this happens we
    trigger memory allocation failures in altmap_alloc_block_buf() and
    trigger warnings of the form:

    WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0
    [..]
    Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_null+0x1d/0x20
    arch_add_memory+0xe4/0xf0
    devm_memremap_pages+0x29b/0x4e0

    Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
    Cc:
    Signed-off-by: Dan Williams

    Dan Williams
     

01 Feb, 2017

2 commits

  • Given that the naming of pmem devices changes from the pmemX form to the
    pmemX.Y form when namespace id is greater than 0, arrange for namespaces
    with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
    of an existing namespace to a new mode results in a name change of the
    resulting block device:

    # ndctl list --namespace=namespace1.0
    {
    "dev":"namespace1.0",
    "mode":"raw",
    "size":2147483648,
    "uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
    "blockdev":"pmem1"
    }

    # ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
    {
    "dev":"namespace1.1",
    "mode":"memory",
    "size":2111832064,
    "uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
    "blockdev":"pmem1.1"
    }

    This change does require tooling changes to explicitly look for
    namespaceX.0 if the seed has already advanced to another namespace.

    Cc:
    Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Declare device_type structure as const as it is only stored in the
    type field of a device structure. This field is of type const, so add
    const to declaration of device_type structure.

    File size before:
    text data bss dec hex filename
    19278 3199 16 22493 57dd nvdimm/namespace_devs.o

    File size after:
    text data bss dec hex filename
    19929 3160 16 23105 5a41 nvdimm/namespace_devs.o

    Signed-off-by: Bhumika Goyal
    Signed-off-by: Dan Williams

    Bhumika Goyal
     

14 Jan, 2017

1 commit

  • Commit 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple
    pmem-namespaces per region") added support for establishing additional
    pmem namespace beyond the seed device, similar to blk namespaces.
    However, it neglected to delete the namespace when the size is set to
    zero.

    Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
    Cc:
    Signed-off-by: Dan Williams

    Dan Williams
     

13 Jan, 2017

1 commit

  • The read_pmem() function uses memcpy_mcsafe() on x86 where an EFAULT
    error code indicates a failed read. Block I/O should use EIO to
    indicate failure. Other pmem code paths (like bad blocks) already use
    EIO so let's be consistent.

    This fixes compatibility with consumers like btrfs that try to parse the
    specific error code rather than treat all errors the same.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Dan Williams

    Stefan Hajnoczi