15 Feb, 2017

2 commits

  • commit bfb34527a32a1a576d9bfb7026d3ab0369a6cd60 upstream.

    When vmemmap_populate() allocates space for the memmap it does so in 2MB
    sized chunks. The libnvdimm-pfn driver incorrectly accounts for this
    when the alignment of the device is set to 4K. When this happens we
    trigger memory allocation failures in altmap_alloc_block_buf() and
    trigger warnings of the form:

    WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0
    [..]
    Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_null+0x1d/0x20
    arch_add_memory+0xe4/0xf0
    devm_memremap_pages+0x29b/0x4e0

    Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 9d032f4201d39e5cf43a8709a047e481f5723fdc upstream.

    Given that the naming of pmem devices changes from the pmemX form to the
    pmemX.Y form when namespace id is greater than 0, arrange for namespaces
    with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
    of an existing namespace to a new mode results in a name change of the
    resulting block device:

    # ndctl list --namespace=namespace1.0
    {
    "dev":"namespace1.0",
    "mode":"raw",
    "size":2147483648,
    "uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
    "blockdev":"pmem1"
    }

    # ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
    {
    "dev":"namespace1.1",
    "mode":"memory",
    "size":2111832064,
    "uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
    "blockdev":"pmem1.1"
    }

    This change does require tooling changes to explicitly look for
    namespaceX.0 if the seed has already advanced to another namespace.

    Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

26 Jan, 2017

1 commit

  • commit 1f19b983a8877f81763fab3e693c6befe212736d upstream.

    Commit 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple
    pmem-namespaces per region") added support for establishing additional
    pmem namespace beyond the seed device, similar to blk namespaces.
    However, it neglected to delete the namespace when the size is set to
    zero.

    Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

09 Jan, 2017

1 commit

  • commit af7d9f0c57941b465043681cb5c3410f7f3f1a41 upstream.

    Fix the format specifier so that the attribute can be parsed correctly.
    Currently it returns decimal 1000 for a 4096-byte alignment.

    Reported-by: Dave Jiang
    Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

07 Dec, 2016

1 commit

  • Given ambiguities in the ACPI 6.1 definition of the "Output (Size)"
    field of the ARS (Address Range Scrub) Status command, a firmware
    implementation may in practice return 0, 4, or 8 to indicate that there
    is no output payload to process.

    The specification states "Size of Output Buffer in bytes, including this
    field.". However, 'Output Buffer' is also the name of the entire
    payload, and earlier in the specification it states "Max Query ARS
    Status Output Buffer Size: Maximum size of buffer (including the Status
    and Extended Status fields)".

    Without this fix if the BIOS happens to return 0 it causes memory
    corruption as evidenced by this result from the acpi_nfit_ctl() unit
    test.

    ars_status00000000: 00020000 00000000 ........
    BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff)
    kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC
    task: ffff8803332d2ec0 task.stack: ffffc9000174c000
    RIP: 0010:[] [] __memcpy+0x12/0x20
    RSP: 0018:ffffc9000174f9a8 EFLAGS: 00010246
    RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56
    RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000
    RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8
    R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0
    FS: 00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0
    Stack:
    ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac
    0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000
    0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246
    Call Trace:
    [] ? acpi_nfit_ctl+0x49d/0x750 [nfit]
    [] nfit_test_probe+0x670/0xb1b [nfit_test]

    Cc:
    Fixes: 747ffe11b440 ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing")
    Signed-off-by: Dan Williams

    Dan Williams
     

28 Oct, 2016

1 commit

  • A bugfix just tried to address a randconfig build problem and introduced
    a variant of the same problem: with CONFIG_LIBNVDIMM=y and
    CONFIG_NVDIMM_DAX=m, the nvdimm module now fails to link:

    drivers/nvdimm/built-in.o: In function `to_nd_device_type':
    bus.c:(.text+0x1b5d): undefined reference to `is_nd_dax'
    drivers/nvdimm/built-in.o: In function `nd_region_notify_driver_action.constprop.2':
    region_devs.c:(.text+0x6b6c): undefined reference to `is_nd_dax'
    region_devs.c:(.text+0x6b8c): undefined reference to `to_nd_dax'
    drivers/nvdimm/built-in.o: In function `nd_region_probe':
    region.c:(.text+0x70f3): undefined reference to `nd_dax_create'
    drivers/nvdimm/built-in.o: In function `mode_show':
    namespace_devs.c:(.text+0xa196): undefined reference to `is_nd_dax'
    drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
    (.text+0xa55f): undefined reference to `is_nd_dax'
    drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
    (.text+0xa56e): undefined reference to `to_nd_dax'

    This reverts the earlier fix, making NVDIMM_DAX a 'bool' option again
    as it should be (it gets linked into the libnvdimm module). To fix
    the original problem, I'm adding a dependency on LIBNVDIMM to
    DEV_DAX_PMEM, which ensures we can't have that one built-in if the
    rest is a module.

    Fixes: 4e65e9381c7a ("/dev/dax: fix Kconfig dependency build breakage")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Arnd Bergmann
     

20 Oct, 2016

2 commits

  • ACPI Clear Uncorrectable Error DSM function may fail or may be
    unsupported on a platform. pmem_clear_poison() returns without clearing
    badblocks in such cases. This failure is detected at the next read
    (-EIO).

    This behavior can lead to an issue when user keeps writing but does not
    read immediately. For instance, flight recorder file may be only read
    when it is necessary for troubleshooting.

    Change pmem_do_bvec() and pmem_clear_poison() to return -EIO so that
    filesystem can log an error message on a write error.

    Cc: Vishal Verma
    Signed-off-by: Toshi Kani
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • If the kcalloc() fails then "devs" can be NULL and we dereference it
    checking "devs[i]".

    Fixes: 1b40e09a1232 ('libnvdimm: blk labels and namespace instantiation')
    Signed-off-by: Dan Carpenter
    Signed-off-by: Dan Williams

    Dan Carpenter
     

08 Oct, 2016

12 commits

  • Dan Williams
     
  • Dan Williams
     
  • The function dax_pmem_probe() in drivers/dax/pmem.c is compiled under the
    CONFIG_DEV_DAX_PMEM tri-state config option. This config option currently
    only depends on CONFIG_NVDIMM_DAX, a bool, which means that the following
    configuration is possible:

    CONFIG_LIBNVDIMM=m
    ...
    CONFIG_NVDIMM_DAX=y
    CONFIG_DEV_DAX=y
    CONFIG_DEV_DAX_PMEM=y

    With this config LIBNVDIMM is compiled as a module with NVDIMM_DAX=y just
    meaning that we will compile drivers/nvdimm/dax_devs.c into that module.
    However, dax_pmem_probe() depends on several symbols defined in
    drivers/nvdimm/dax_devs.c, which results in the following build errors:

    drivers/built-in.o: In function `dax_pmem_probe':
    linux/drivers/dax/pmem.c:70: undefined reference to `to_nd_dax'
    linux/drivers/dax/pmem.c:74: undefined reference to
    `nvdimm_namespace_common_probe'
    linux/drivers/dax/pmem.c:80: undefined reference to `devm_nsio_enable'
    linux/drivers/dax/pmem.c:81: undefined reference to `nvdimm_setup_pfn'
    linux/drivers/dax/pmem.c:84: undefined reference to `devm_nsio_disable'
    linux/drivers/dax/pmem.c:122: undefined reference to `to_nd_region'
    drivers/built-in.o: In function `dax_pmem_init':
    linux/drivers/dax/pmem.c:147: undefined reference to `__nd_driver_register'

    Fix this by making NVDIMM_DAX a tristate. DEV_DAX_PMEM depends on
    NVDIMM_DAX which depends on LIBNVDIMM. Since they are all now tristates,
    if LIBNVDIMM is built as a kernel module DEV_DAX_PMEM will be as well.
    This prevents dax_devs.c from being built as a built-in while its
    dependencies are in the libnvdimm.ko module.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Ross Zwisler
     
  • Similar to BLK regions, publish new seed namespace devices to allow
    unused PMEM region capacity to be consumed by additional namespaces.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Now that the rest of the infrastructure has been converted to handle
    multi-pmem configurations, lift the artificial barrier at scan time.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Short-circuit doomed-to-fail label validation attempts by skipping
    labels that are outside the given region. For example a DIMM that has
    multiple PMEM regions will waste time attempting to create namespaces
    only to find that the interleave-set-cookie does not validate, e.g.:

    nd_region region6: invalid cookie in label: 73e608dc-47b9-4b2a-b5c7-2d55a32e0c2

    Similar to how we skip BLK labels when performing PMEM validation we can
    skip out-of-range labels early.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Now that we have nd_region_available_dpa() able to handle the presence
    of multiple PMEM allocations in aliased PMEM regions, reuse that same
    infrastructure to track allocations from free space. In particular
    handle allocating from an aliased PMEM region in the case where there
    are dis-contiguous holes. The allocation for BLK and PMEM are
    documented in the space_valid() helper:

    BLK-space is valid as long as it does not precede a PMEM
    allocation in a given region. PMEM-space must be contiguous
    and adjacent to an existing existing allocation (if one
    exists).

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Instead of assuming that there will only ever be one allocated range at
    the start of the region, account for additional namespaces that might
    start at an offset from the region base.

    After this change pmem namespaces now have a reason to carry an array of
    resources similar to blk. Unifying the resource tracking infrastructure
    in nd_namespace_common is a future cleanup candidate.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • pmem devices are currently named /dev/pmem. Preserve the
    naming of the 0th device, but add a "." for other
    devices.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • The free dpa (dimm-physical-address) space calculation reports how much
    free space is available with consideration for aliased BLK + PMEM
    regions. Recall that BLK capacity is allocated from high addresses and
    PMEM is allocated from low addresses in their respective regions.

    nd_region_available_dpa() accounts for the fact that the largest
    encroachment (lowest starting address) into PMEM capacity by a BLK
    allocation limits the available capacity to that point, regardless if
    there is BLK allocation hole at a higher address. Similarly, for the
    multi-pmem case we need to track the largest encroachment (highest
    ending address) of a PMEM allocation in BLK capacity regardless of
    whether there is an allocation hole that a BLK allocation could fill at
    a lower address.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Add more determinism to initial namespace device-name assignments by
    sorting the namespaces by starting dpa.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • If label scanning finds multiple valid pmem namespaces allow them to be
    surfaced rather than fail namespace scanning. Support for creating
    multiple namespaces per region is saved for a later patch.

    Note that this adds some new error messages to clarify which of the pmem
    namespaces in the set are potentially impacted by invalid labels.

    Signed-off-by: Dan Williams

    Dan Williams
     

06 Oct, 2016

2 commits


01 Oct, 2016

5 commits

  • In preparation for enabling multiple namespaces per pmem region, convert
    the label tracking to use a linked list. In particular this will allow
    select_pmem_id() to move labels from the unvalidated state to the
    validated state. Currently we only track one validated set per-region.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Before we add more libnvdimm-private fields to nd_mapping make it clear
    which parameters are input vs libnvdimm internals. Use struct
    nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make
    struct nd_mapping private to libnvdimm.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Existing implemenetation writes to all the flush hint addresses for a
    given ND region. This is not necessary as the flushes are per imc and
    not per DIMM. Search the mappings and clear out the duplicates at init
    to avoid multiple flush to the same imc.

    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     
  • nvdimm_clear_poison cleared the user-visible badblocks, and sent
    commands to the NVDIMM to clear the areas marked as 'poison', but it
    neglected to clear the same areas from the internal poison_list which is
    used to marshal ARS results before sorting them by namespace. As a
    result, once on-demand ARS functionality was added:

    37b137f nfit, libnvdimm: allow an ARS scrub to be triggered on demand

    A scrub triggered from either sysfs or an MCE was found to be adding
    stale entries that had been cleared from gendisk->badblocks, but were
    still present in nvdimm_bus->poison_list. Additionally, the stale entries
    could be triggered into producing stale disk->badblocks by simply disabling
    and re-enabling the namespace or region.

    This adds the missing step of clearing poison_list entries when clearing
    poison, so that it is always in sync with badblocks.

    Fixes: 37b137f ("nfit, libnvdimm: allow an ARS scrub to be triggered on demand")
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • pmem_do_bvec used to kmap_atomic at the begin, and only unmap at the
    end. Things like nvdimm_clear_poison may want to do nvdimm subsystem
    bookkeeping operations that may involve taking locks or doing memory
    allocations, and we can't do that from the atomic context. Reduce the
    atomic context to just what needs it - the memcpy to/from pmem.

    Cc: Ross Zwisler
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

25 Sep, 2016

1 commit

  • The definition of the flush hint table as:

    void __iomem *flush_wpq[0][0];

    ...passed the unit test, but is broken as flush_wpq[0][1] and
    flush_wpq[1][0] refer to the same entry. Fix this to use a helper that
    calculates a slot in the table based on the geometry of flush hints in
    the region. This is important to get right since virtualization
    solutions use this mechanism to trigger hypervisor flushes to platform
    persistence.

    Reported-by: Dave Jiang
    Tested-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Sep, 2016

3 commits


19 Sep, 2016

1 commit

  • nd_activate_region() iomaps any hint addresses required when activating
    a region. To prevent duplicate mappings it checks the PFN of the hint to
    be mapped against the PFNs of the already mapped hints. Unfortunately it
    doesn't convert the PFN back into a physical address before passing it
    to devm_nvdimm_ioremap(). Instead it applies PHYS_PFN a second time
    which ends about as well as you would imagine.

    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Dan Williams

    Oliver O'Halloran
     

10 Sep, 2016

1 commit

  • Bad blocks can be injected via /sys/block/pmemN/badblocks. In a situation
    where legacy pmem is being used or a pmem region created by using memmap
    kernel parameter, the injected bad blocks are not cleared due to
    nvdimm_clear_poison() failing from lack of ndctl function pointer. In
    this case we need to just return as handled and allow the bad blocks to
    be cleared rather than fail.

    Reviewed-by: Vishal Verma
    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

02 Sep, 2016

2 commits

  • 'ndctl list --buses --dimms' does not list any NVDIMM-Ns since
    they are considered as idle. ndctl checks if any driver is
    attached to nmem device. nvdimm_probe() always fails in
    nvdimm_init_nsarea() since NVDIMM-Ns do not implement optinal
    ND_CMD_GET_CONFIG_DATA command.

    Change nvdimm_probe() to accept the case that the CONFIG_DATA
    command is not implemented for NVDIMM-Ns. The driver attaches
    without ndd, which keeps it no-op to the device.

    Reported-by: Brian Boylston
    Signed-off-by: Toshi Kani
    Cc: Dan Williams
    Tested-by: Johannes Thumshirn
    Acked-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Geert Uytterhoeven
     

30 Aug, 2016

1 commit

  • Per "ACPI 6.1 Section 9.20.3" NVDIMM devices, children of the ACPI0012
    NVDIMM Root device, can receive health event notifications.

    Given that these devices are precluded from registering a notification
    handler via acpi_driver.acpi_device_ops (due to no _HID), we use
    acpi_install_notify_handler() directly. The registered handler,
    acpi_nvdimm_notify(), triggers a poll(2) event on the nmemX/nfit/flags
    sysfs attribute when a health event notification is received.

    Cc: Rafael J. Wysocki
    Tested-by: Toshi Kani
    Reviewed-by: Vishal Verma
    Acked-by: Rafael J. Wysocki
    Reviewed-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Aug, 2016

1 commit


08 Aug, 2016

2 commits

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie