24 Mar, 2019

4 commits

  • commit 07464e88365e9236febaca9ed1a2e2006d8bc952 upstream.

    Libnvdimm reserves the first 8K of pfn and devicedax namespaces to
    store a superblock describing the namespace. This 8K reservation
    is contained within the altmap area which the kernel uses for the
    vmemmap backing for the pages within the namespace. The altmap
    allows for some pages at the start of the altmap area to be reserved
    and that mechanism is used to protect the superblock from being
    re-used as vmemmap backing.

    The number of PFNs to reserve is calculated using:

    PHYS_PFN(SZ_8K)

    Which is implemented as:

    #define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT))

    So on systems where PAGE_SIZE is greater than 8K the reservation
    size is truncated to zero and the superblock area is re-used as
    vmemmap backing. As a result all the namespace information stored
    in the superblock (i.e. if it's a PFN or DAX namespace) is lost
    and the namespace needs to be re-created to get access to the
    contents.

    This patch fixes this by using PFN_UP() rather than PHYS_PFN() to ensure
    that at least one page is reserved. On systems with a 4K pages size this
    patch should have no effect.

    Cc: stable@vger.kernel.org
    Cc: Dan Williams
    Fixes: ac515c084be9 ("libnvdimm, pmem, pfn: move pfn setup to the core")
    Signed-off-by: Oliver O'Halloran
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Oliver O'Halloran
     
  • commit fa7d2e639cd90442d868dfc6ca1d4cc9d8bf206e upstream.

    For recovery, where non-dax access is needed to a given physical address
    range, and testing, allow the 'force_raw' attribute to override the
    default establishment of a dev_pagemap.

    Otherwise without this capability it is possible to end up with a
    namespace that can not be activated due to corrupted info-block, and one
    that can not be repaired due to a section collision.

    Cc:
    Fixes: 004f1afbe199 ("libnvdimm, pmem: direct map legacy pmem by default")
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit f101ada7da6551127d192c2f1742c1e9e0f62799 upstream.

    When trying to see whether current nd_region intersects with others,
    trim_pfn_device() has already calculated the *size* to be expanded to
    SECTION size.

    Do not double append 'adjust' to 'size' when calculating whether the end
    of a region collides with the next pmem region.

    Fixes: ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions"
    Cc:
    Signed-off-by: Wei Yang
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     
  • commit 966d23a006ca7b44ac8cf4d0c96b19785e0c3da0 upstream.

    The UEFI 2.7 specification sets expectations that the 'updating' flag is
    eventually cleared. To date, the libnvdimm core has never adhered to
    that protocol. The policy of the core matches the policy of other
    multi-device info-block formats like MD-Software-RAID that expect
    administrator intervention on inconsistent info-blocks, not automatic
    invalidation.

    However, some pre-boot environments may unfortunately attempt to "clean
    up" the labels and invalidate a set when it fails to find at least one
    "non-updating" label in the set. Clear the updating flag after set
    updates to minimize the window of vulnerability to aggressive pre-boot
    environments.

    Ideally implementations would not write to the label area outside of
    creating namespaces.

    Note that this only minimizes the window, it does not close it as the
    system can still crash while clearing the flag and the set can be
    subsequently deleted / invalidated by the pre-boot environment.

    Fixes: f524bf271a5c ("libnvdimm: write pmem label set")
    Cc:
    Cc: Kelly Couch
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

13 Jan, 2019

1 commit

  • commit a95c90f1e2c253b280385ecf3d4ebfe476926b28 upstream.

    The last step before devm_memremap_pages() returns success is to allocate
    a release action, devm_memremap_pages_release(), to tear the entire setup
    down. However, the result from devm_add_action() is not checked.

    Checking the error from devm_add_action() is not enough. The api
    currently relies on the fact that the percpu_ref it is using is killed by
    the time the devm_memremap_pages_release() is run. Rather than continue
    this awkward situation, offload the responsibility of killing the
    percpu_ref to devm_memremap_pages_release() directly. This allows
    devm_memremap_pages() to do the right thing relative to init failures and
    shutdown.

    Without this change we could fail to register the teardown of
    devm_memremap_pages(). The likelihood of hitting this failure is tiny as
    small memory allocations almost always succeed. However, the impact of
    the failure is large given any future reconfiguration, or disable/enable,
    of an nvdimm namespace will fail forever as subsequent calls to
    devm_memremap_pages() will fail to setup the pgmap_radix since there will
    be stale entries for the physical address range.

    An argument could be made to require that the ->kill() operation be set in
    the @pgmap arg rather than passed in separately. However, it helps code
    readability, tracking the lifetime of a given instance, to be able to grep
    the kill routine directly at the devm_memremap_pages() call site.

    Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
    Reviewed-by: "Jérôme Glisse"
    Reported-by: Logan Gunthorpe
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Christoph Hellwig
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

13 Dec, 2018

1 commit

  • commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

    Commit cfe30b872058 "libnvdimm, pmem: adjust for section collisions with
    'System RAM'" enabled Linux to workaround occasions where platform
    firmware arranges for "System RAM" and "Persistent Memory" to collide
    within a single section boundary. Unfortunately, as reported in this
    issue [1], platform firmware can inflict the same collision between
    persistent memory regions.

    The approach of interrogating iomem_resource does not work in this
    case because platform firmware may merge multiple regions into a single
    iomem_resource range. Instead provide a method to interrogate regions
    that share the same parent bus.

    This is a stop-gap until the core-MM can grow support for hotplug on
    sub-section boundaries.

    [1]: https://github.com/pmem/ndctl/issues/76

    Fixes: cfe30b872058 ("libnvdimm, pmem: adjust for section collisions with...")
    Cc:
    Reported-by: Patrick Geary
    Tested-by: Patrick Geary
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

14 Nov, 2018

3 commits

  • commit 91ed7ac444ef749603a95629a5ec483988c4f14b upstream.

    The driver is only initializing bb_res in the devm_memremap_pages()
    paths, but the raw namespace case is passing an uninitialized bb_res to
    nvdimm_badblocks_populate().

    Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
    Cc:
    Cc: Christoph Hellwig
    Reported-by: Jacek Zloch
    Reported-by: Krzysztof Rusocki
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 5d394eee2c102453278d81d9a7cf94c80253486a upstream.

    While experimenting with region driver loading the following backtrace
    was triggered:

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    [..]
    Call Trace:
    dump_stack+0x85/0xcb
    register_lock_class+0x571/0x580
    ? __lock_acquire+0x2ba/0x1310
    ? kernfs_seq_start+0x2a/0x80
    __lock_acquire+0xd4/0x1310
    ? dev_attr_show+0x1c/0x50
    ? __lock_acquire+0x2ba/0x1310
    ? kernfs_seq_start+0x2a/0x80
    ? lock_acquire+0x9e/0x1a0
    lock_acquire+0x9e/0x1a0
    ? dev_attr_show+0x1c/0x50
    badblocks_show+0x70/0x190
    ? dev_attr_show+0x1c/0x50
    dev_attr_show+0x1c/0x50

    This results from a missing successful call to devm_init_badblocks()
    from nd_region_probe(). Block attempts to show badblocks while the
    region is not enabled.

    Fixes: 6a6bef90425e ("libnvdimm: add mechanism to publish badblocks...")
    Cc:
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Dave Jiang
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit b6eae0f61db27748606cc00dafcfd1e2c032f0a5 upstream.

    Unlike asynchronous initialization in the core we have not yet associated
    the device with the parent, and as such the device doesn't hold a reference
    to the parent.

    In order to resolve that we should be holding a reference on the parent
    until the asynchronous initialization has completed.

    Cc:
    Fixes: 4d88a97aa9e8 ("libnvdimm: ...base ... infrastructure")
    Signed-off-by: Alexander Duyck
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Alexander Duyck
     

26 Aug, 2018

2 commits

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     
  • Pull libnvdimm updates from Dave Jiang:
    "Collection of misc libnvdimm patches for 4.19 submission:

    - Adding support to read locked nvdimm capacity.

    - Change test code to make DSM failure code injection an override.

    - Add support for calculate maximum contiguous area for namespace.

    - Add support for queueing a short ARS when there is on going ARS for
    nvdimm.

    - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
    params.

    - Improve smart injection support for nvdimm emulation testing.

    - Fix test code that supports for emulating controller temperature.

    - Fix hang on error before devm_memremap_pages()

    - Fix a bug that causes user memory corruption when data returned to
    user for ars_status.

    - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
    fsdax"

    * tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: fix ars_status output length calculation
    device-dax: avoid hang on error before devm_memremap_pages()
    tools/testing/nvdimm: improve emulation of smart injection
    filesystem-dax: Do not request kaddr and pfn when not required
    md/dm-writecache: Don't request pointer dummy_addr when not required
    dax/super: Do not request a pointer kaddr when not required
    tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
    s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
    libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
    acpi/nfit: queue issuing of ars when an uc error notification comes in
    libnvdimm: Export max available extent
    libnvdimm: Use max contiguous area for namespace size
    MAINTAINERS: Add Jan Kara for filesystem DAX
    MAINTAINERS: update Ross Zwisler's email address
    tools/testing/nvdimm: Fix support for emulating controller temperature
    tools/testing/nvdimm: Make DSM failure code injection an override
    acpi, nfit: Prefer _DSM over _LSR for namespace label reads
    libnvdimm: Introduce locked DIMM capacity support

    Linus Torvalds
     

21 Aug, 2018

1 commit

  • Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
    of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
    has also been marked UC in PAT (page attribute table) via
    set_mce_nospec() to prevent speculative retrievals of poison.

    The 'HWPoison' flag is only cleared when overwriting an entire page.

    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     

20 Aug, 2018

1 commit

  • Commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
    Introduced additional hardening for ambiguity in the ACPI spec for
    ars_status output sizing. However, it had a couple of cases mixed up.
    Where it should have been checking for (and returning) "out_field[1] -
    4" it was using "out_field[1] - 8" and vice versa.

    This caused a four byte discrepancy in the buffer size passed on to
    the command handler, and in some cases, this caused memory corruption
    like:

    ./daxdev-errors.sh: line 76: 24104 Aborted (core dumped) ./daxdev-errors $busdev $region
    malloc(): memory corruption
    Program received signal SIGABRT, Aborted.
    [...]
    #5 0x00007ffff7865a2e in calloc () from /lib64/libc.so.6
    #6 0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136
    #7 0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144
    #8 test_daxdev_clear_error (region_name=, bus_name=)
    at daxdev-errors.c:332

    Cc:
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Lukasz Dorau
    Cc: Dan Williams
    Fixes: efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
    Signed-off-by: Vishal Verma
    Reviewed-by: Keith Busch
    Signed-of-by: Dave Jiang

    Vishal Verma
     

06 Aug, 2018

1 commit


31 Jul, 2018

1 commit

  • pmem_direct_access() needs to check the validity of pointers kaddr
    and pfn for NULL assignment. If anyone equals to NULL, it doesn't need
    to calculate the value.

    If pointer equals to NULL, that is to say callers may have no need for
    kaddr or pfn, so this patch is prepared for allowing them to pass in
    NULL instead of having to pass in a pointer or local variable that
    they then just throw away.

    Signed-off-by: Huaisheng Ye
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dave Jiang

    Huaisheng Ye
     

26 Jul, 2018

2 commits

  • The 'available_size' attribute showing the combined total of all
    unallocated space isn't always useful to know how large of a namespace
    a user may be able to allocate if the region is fragmented. This patch
    will export the largest extent of unallocated space that may be allocated
    to create a new namespace.

    Signed-off-by: Keith Busch
    Reviewed-by: Vishal Verma
    Signed-off-by: Dave Jiang

    Keith Busch
     
  • This patch will find the max contiguous area to determine the largest
    pmem namespace size that can be created. If the requested size exceeds
    the largest available, ENOSPC error will be returned.

    This fixes the allocation underrun error and wrong error return code
    that have otherwise been observed as the following kernel warning:

    WARNING: CPU: PID: at drivers/nvdimm/namespace_devs.c:913 size_store

    Fixes: a1f3e4d6a0c3 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support")
    Cc:
    Signed-off-by: Keith Busch
    Reviewed-by: Vishal Verma
    Signed-off-by: Dave Jiang

    Keith Busch
     

18 Jul, 2018

2 commits

  • Add and use a new op_stat_group() function for indexing partition stat
    fields rather than indexing them by rq_data_dir() or bio_data_dir().
    This function works similarly to op_is_sync() in that it takes the
    request::cmd_flags or bio::bi_opf flags and determines which stats
    should et updated.

    In addition, the second parameter to generic_start_io_acct() and
    generic_end_io_acct() is now a REQ_OP rather than simply a read or
    write bit and it uses op_stat_group() on the parameter to determine
    the stat group.

    Note that the partition in_flight counts are not part of the per-cpu
    statistics and as such are not indexed via this function. It's now
    indexed by op_is_write().

    tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Minchan Kim
    Cc: Dan Williams
    Cc: Joshua Morris
    Cc: Philipp Reisner
    Cc: Matias Bjorling
    Cc: Kent Overstreet
    Cc: Alasdair Kergon
    Signed-off-by: Jens Axboe

    Michael Callahan
     
  • c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
    read/write") replaced @op with boolean @is_write, which limited the
    amount of information going into ->rw_page() and more importantly
    page_endio(), which removed the need to expose block internals to mm.

    Unfortunately, we want to track discards separately and @is_write
    isn't enough information. This patch updates bdev_ops->rw_page() to
    take REQ_OP instead but leaves page_endio() to take bool @is_write.
    This allows the block part of operations to have enough information
    while not leaking it to mm.

    Signed-off-by: Tejun Heo
    Cc: Mike Christie
    Cc: Minchan Kim
    Cc: Dan Williams
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jul, 2018

1 commit


14 Jul, 2018

1 commit

  • Pull libnvdimm fixes from Dave Jiang:

    - ensure that a variable passed in by reference to acpi_nfit_ctl is
    always set to a value. An incremental patch is provided due to notice
    from testing in -next. The rest of the commits did not exhibit
    issues.

    - fix a return path in nsio_rw_bytes() that was not returning "bytes
    remain" as expected for the function.

    - address an issue where applications polling on scrub-completion for
    the NVDIMM may falsely wakeup and read the wrong state value and
    cause hang.

    - change the test unit persistent capability attribute to fix up a
    broken assumption in the unit test infrastructure wrt the
    'write_cache' attribute

    - ratelimit dev_info() in the dax device check_vma() function since
    this is easily triggered from userspace

    * tag 'libnvdimm-fixes-4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    nfit: fix unchecked dereference in acpi_nfit_ctl
    acpi, nfit: Fix scrub idle detection
    tools/testing/nvdimm: advertise a write cache for nfit_test
    acpi/nfit: fix cmd_rc for acpi_nfit_ctl to always return a value
    dev-dax: check_vma: ratelimit dev_info-s
    libnvdimm, pmem: Fix memcpy_mcsafe() return code handling in nsio_rw_bytes()

    Linus Torvalds
     

29 Jun, 2018

2 commits

  • Commit 60622d68227d "x86/asm/memcpy_mcsafe: Return bytes remaining"
    converted callers of memcpy_mcsafe() to expect a positive 'bytes
    remaining' value rather than a negative error code. The nsio_rw_bytes()
    conversion failed to return success. The failure is benign in that
    nsio_rw_bytes() will end up writing back what it just read.

    Fixes: 60622d68227d ("x86/asm/memcpy_mcsafe: Return bytes remaining")
    Cc: Dan Williams
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     
  • QUEUE_FLAG_DAX is an indication that a given block device supports
    filesystem DAX and should not be set for PMEM namespaces which are in "raw"
    mode. These namespaces lack struct page and are prevented from
    participating in filesystem DAX as of commit 569d0365f571 ("dax: require
    'struct page' by default for filesystem dax").

    Signed-off-by: Ross Zwisler
    Suggested-by: Mike Snitzer
    Fixes: 569d0365f571 ("dax: require 'struct page' by default for filesystem dax")
    Cc: stable@vger.kernel.org
    Acked-by: Dan Williams
    Reviewed-by: Toshi Kani
    Signed-off-by: Mike Snitzer

    Ross Zwisler
     

09 Jun, 2018

2 commits


07 Jun, 2018

3 commits

  • This commit:

    5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")

    intended to make sure that deep flush was always available even on
    platforms which support a power-fail protected CPU cache. An unintended
    side effect of this change was that we also lost the ability to skip
    flushing CPU caches on those power-fail protected CPU cache.

    Fix this by skipping the low level cache flushing in dax_flush() if we have
    CPU caches which are power-fail protected. The user can still override this
    behavior by manually setting the write_cache state of a namespace. See
    libndctl's ndctl_namespace_write_cache_is_enabled(),
    ndctl_namespace_enable_write_cache() and
    ndctl_namespace_disable_write_cache() functions.

    Cc:
    Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Ross Zwisler
     
  • Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
    write to each of the flush hints for a region) in response to an
    msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
    we were setting up the request queue. This happens due to the write cache
    value passed in to blk_queue_write_cache(), which then causes the block
    layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set. We do have a
    "write_cache" sysfs entry for namespaces, i.e.:

    /sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache

    which can be used to control whether or not the kernel thinks a given
    namespace has a write cache, but this didn't modify the deep flush behavior
    that we set up when the driver was initialized. Instead, it only modified
    whether or not DAX would flush CPU caches via dax_flush() in response to
    *sync calls.

    Simplify this by making the *sync deep flush always happen, regardless of
    the write cache setting of a namespace. The DAX CPU cache flushing will
    still be controlled the write_cache setting of the namespace.

    Cc:
    Fixes: 5fdf8e5ba566 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Ross Zwisler
     
  • Complete the move from REQ_FLUSH to REQ_PREFLUSH that apparently started
    way back in v4.8.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Ross Zwisler
     

03 Jun, 2018

2 commits

  • There is currently a mismatch between the resources that will trigger
    the e820_pmem driver to register/load and the resources that will
    actually be surfaced as pmem ranges. register_e820_pmem() uses
    walk_iomem_res_desc() which includes children and siblings. In contrast,
    e820_pmem_probe() only considers top level resources. For example the
    following resource tree results in the driver being loaded, but no
    resources being registered:

    398000000000-39bfffffffff : PCI Bus 0000:ae
    39be00000000-39bf07ffffff : PCI Bus 0000:af
    39be00000000-39beffffffff : 0000:af:00.0
    39be10000000-39beffffffff : Persistent Memory (legacy)

    Fix this up to allow definitions of "legacy" pmem ranges anywhere in
    system-physical address space. Not that it is a recommended or safe to
    define a pmem range in PCI space, but it is useful for debug /
    experimentation, and the restriction on being a top-level resource was
    arbitrary.

    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Instrument nvdimm_bus_probe() to emit timestamps for the start and end
    of libnvdimm device probing. This is useful for identifying sources of
    libnvdimm sub-system initialization latency.

    Signed-off-by: Dan Williams

    Dan Williams
     

01 Jun, 2018

1 commit

  • The pmem driver does not honor a forced read-only setting for very long:
    $ blockdev --setro /dev/pmem0
    $ blockdev --getro /dev/pmem0
    1

    followed by various commands like these:
    $ blockdev --rereadpt /dev/pmem0
    or
    $ mkfs.ext4 /dev/pmem0

    results in this in the kernel serial log:
    nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write

    with the read-only setting lost:
    $ blockdev --getro /dev/pmem0
    0

    That's from bus.c nvdimm_revalidate_disk(), which always applies the
    setting from nd_region (which is initially based on the ACPI NFIT
    NVDIMM state flags not_armed bit).

    In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when
    re-reading partition") fixed this issue for SCSI devices to preserve
    the previous setting if it was set to read-only.

    This patch modifies bus.c to preserve any previous read-only setting.
    It also eliminates the kernel serial log print except for cases where
    read-write is changed to read-only, so it doesn't print read-only to
    read-only non-changes.

    Cc:
    Fixes: 581388209405 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only")
    Signed-off-by: Robert Elliott
    Signed-off-by: Dan Williams

    Robert Elliott
     

23 May, 2018

2 commits

  • Use the machine check safe version of copy_to_iter() for the
    ->copy_to_iter() operation published by the pmem driver.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Similar to the ->copy_from_iter() operation, a platform may want to
    deploy an architecture or device specific routine for handling reads
    from a dax_device like /dev/pmemX. On x86 this routine will point to a
    machine check safe version of copy_to_iter(). For now, add the plumbing
    to device-mapper and the dax core.

    Cc: Ross Zwisler
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

22 May, 2018

1 commit

  • In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
    be able to rely on the fact that they will get wakeups on dev_pagemap
    page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
    generic_dax_page_free() as common indicator / infrastructure for dax
    filesytems to require. With this change there are no users of the
    MEMORY_DEVICE_HOST designation, so remove it.

    The HMM sub-system extended dev_pagemap to arrange a callback when a
    dev_pagemap managed page is freed. Since a dev_pagemap page is free /
    idle when its reference count is 1 it requires an additional branch to
    check the page-type at put_page() time. Given put_page() is a hot-path
    we do not want to incur that check if HMM is not in use, so a static
    branch is used to avoid that overhead when not necessary.

    Now, the FS_DAX implementation wants to reuse this mechanism for
    receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
    static-key into a generic mechanism that either HMM or FS_DAX code paths
    can enable.

    For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
    care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
    However, we still need to support FS_DAX in the FS_DAX_LIMITED case
    implemented by the s390/dcssblk driver.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michal Hocko
    Reported-by: kbuild test robot
    Reported-by: Thomas Meyer
    Reported-by: Dave Jiang
    Cc: "Jérôme Glisse"
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

15 May, 2018

1 commit

  • Machine check safe memory copies are currently deployed in the pmem
    driver whenever reading from persistent memory media, so that -EIO is
    returned rather than triggering a kernel panic. While this protects most
    pmem accesses, it is not complete in the filesystem-dax case. When
    filesystem-dax is enabled reads may bypass the block layer and the
    driver via dax_iomap_actor() and its usage of copy_to_iter().

    In preparation for creating a copy_to_iter() variant that can handle
    machine checks, teach memcpy_mcsafe() to return the number of bytes
    remaining rather than -EFAULT when an exception occurs.

    Co-developed-by: Tony Luck
    Signed-off-by: Dan Williams
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: hch@lst.de
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-nvdimm@lists.01.org
    Link: http://lkml.kernel.org/r/152539238119.31796.14318473522414462886.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar

    Dan Williams
     

20 Apr, 2018

2 commits


16 Apr, 2018

1 commit

  • The new support for the standard _LSR and _LSW methods neglected to also
    update the nvdimm_init_config_data() and nvdimm_set_config_data() to
    return the translated error code from failed commands. This precision is
    necessary because the locked status that was previously returned on
    ND_CMD_GET_CONFIG_SIZE commands is now returned on
    ND_CMD_{GET,SET}_CONFIG_DATA commands.

    If the kernel misses this indication it can inadvertently fall back to
    label-less mode when it should otherwise avoid all access to locked
    regions.

    Cc:
    Fixes: 4b27db7e26cd ("acpi, nfit: add support for the _LSI, _LSR, and...")
    Signed-off-by: Dan Williams

    Dan Williams
     

11 Apr, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This cycle was was not something I ever want to repeat as there were
    several late changes that have only now just settled.

    Half of the branch up to commit d2c997c0f145 ("fs, dax: use
    page->mapping to warn...") have been in -next for several releases.
    The of_pmem driver and the address range scrub rework were late
    arrivals, and the dax work was scaled back at the last moment.

    The of_pmem driver missed a previous merge window due to an oversight.
    A sense of obligation to rectify that miss is why it is included for
    4.17. It has acks from PowerPC folks. Stephen reported a build failure
    that only occurs when merging it with your latest tree, for now I have
    fixed that up by disabling modular builds of of_pmem. A test merge
    with your tree has received a build success report from the 0day robot
    over 156 configs.

    An initial version of the ARS rework was submitted before the merge
    window. It is self contained to libnvdimm, a net code reduction, and
    passing all unit tests.

    The filesystem-dax changes are based on the wait_var_event()
    functionality from tip/sched/core. However, late review feedback
    showed that those changes regressed truncate performance to a large
    degree. The branch was rewound to drop the truncate behavior change
    and now only includes preparation patches and cleanups (with full acks
    and reviews). The finalization of this dax-dma-vs-trnucate work will
    need to wait for 4.18.

    Summary:

    - A rework of the filesytem-dax implementation provides for detection
    of unmap operations (truncate / hole punch) colliding with
    in-progress device-DMA. A fix for these collisions remains a
    work-in-progress pending resolution of truncate latency and
    starvation regressions.

    - The of_pmem driver expands the users of libnvdimm outside of x86
    and ACPI to describe an implementation of persistent memory on
    PowerPC with Open Firmware / Device tree.

    - Address Range Scrub (ARS) handling is completely rewritten to
    account for the fact that ARS may run for 100s of seconds and there
    is no platform defined way to cancel it. ARS will now no longer
    block namespace initialization.

    - The NVDIMM Namespace Label implementation is updated to handle
    label areas as small as 1K, down from 128K.

    - Miscellaneous cleanups and updates to unit test infrastructure"

    * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
    libnvdimm, of_pmem: workaround OF_NUMA=n build error
    nfit, address-range-scrub: add module option to skip initial ars
    nfit, address-range-scrub: rework and simplify ARS state machine
    nfit, address-range-scrub: determine one platform max_ars value
    powerpc/powernv: Create platform devs for nvdimm buses
    doc/devicetree: Persistent memory region bindings
    libnvdimm: Add device-tree based driver
    libnvdimm: Add of_node to region and bus descriptors
    libnvdimm, region: quiet region probe
    libnvdimm, namespace: use a safe lookup for dimm device name
    libnvdimm, dimm: fix dpa reservation vs uninitialized label area
    libnvdimm, testing: update the default smart ctrl_temperature
    libnvdimm, testing: Add emulation for smart injection commands
    nfit, address-range-scrub: introduce nfit_spa->ars_state
    libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
    nfit, address-range-scrub: fix scrub in-progress reporting
    dax, dm: allow device-mapper to operate without dax support
    dax: introduce CONFIG_DAX_DRIVER
    fs, dax: use page->mapping to warn if truncate collides with a busy page
    ext2, dax: introduce ext2_dax_aops
    ...

    Linus Torvalds
     

10 Apr, 2018

1 commit