13 Jan, 2019

1 commit

  • commit a95c90f1e2c253b280385ecf3d4ebfe476926b28 upstream.

    The last step before devm_memremap_pages() returns success is to allocate
    a release action, devm_memremap_pages_release(), to tear the entire setup
    down. However, the result from devm_add_action() is not checked.

    Checking the error from devm_add_action() is not enough. The api
    currently relies on the fact that the percpu_ref it is using is killed by
    the time the devm_memremap_pages_release() is run. Rather than continue
    this awkward situation, offload the responsibility of killing the
    percpu_ref to devm_memremap_pages_release() directly. This allows
    devm_memremap_pages() to do the right thing relative to init failures and
    shutdown.

    Without this change we could fail to register the teardown of
    devm_memremap_pages(). The likelihood of hitting this failure is tiny as
    small memory allocations almost always succeed. However, the impact of
    the failure is large given any future reconfiguration, or disable/enable,
    of an nvdimm namespace will fail forever as subsequent calls to
    devm_memremap_pages() will fail to setup the pgmap_radix since there will
    be stale entries for the physical address range.

    An argument could be made to require that the ->kill() operation be set in
    the @pgmap arg rather than passed in separately. However, it helps code
    readability, tracking the lifetime of a given instance, to be able to grep
    the kill routine directly at the devm_memremap_pages() call site.

    Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
    Reviewed-by: "Jérôme Glisse"
    Reported-by: Logan Gunthorpe
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Christoph Hellwig
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

23 Sep, 2018

1 commit

  • With address_space_operations missing for device dax, namely the
    .set_page_dirty, we hit a kernel warning when running destructive
    ndctl unit test: make TESTS=device-dax check

    WARNING: CPU: 3 PID: 7380 at fs/buffer.c:581 __set_page_dirty+0xb1/0xc0

    Setting address_space_operations to noop_set_page_dirty and
    noop_invalidatepage for device dax to prevent fallback to
    __set_page_dirty_buffers() and block_invalidatepage() respectively.

    Fixes: 2232c6382a ("device-dax: Enable page_mapping()")

    Acked-by: Jeff Moyer
    Reported-by: Vishal Verma
    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

05 Sep, 2018

1 commit

  • As part of 226ab561075f ("device-dax: Convert to vmf_insert_mixed and
    vm_fault_t") in 4.19-rc1, 'rc' was not converted to vm_fault_t. Now
    converted.

    Link: http://lkml.kernel.org/r/20180830153813.GA26059@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

26 Aug, 2018

2 commits

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     
  • Pull libnvdimm updates from Dave Jiang:
    "Collection of misc libnvdimm patches for 4.19 submission:

    - Adding support to read locked nvdimm capacity.

    - Change test code to make DSM failure code injection an override.

    - Add support for calculate maximum contiguous area for namespace.

    - Add support for queueing a short ARS when there is on going ARS for
    nvdimm.

    - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
    params.

    - Improve smart injection support for nvdimm emulation testing.

    - Fix test code that supports for emulating controller temperature.

    - Fix hang on error before devm_memremap_pages()

    - Fix a bug that causes user memory corruption when data returned to
    user for ars_status.

    - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
    fsdax"

    * tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: fix ars_status output length calculation
    device-dax: avoid hang on error before devm_memremap_pages()
    tools/testing/nvdimm: improve emulation of smart injection
    filesystem-dax: Do not request kaddr and pfn when not required
    md/dm-writecache: Don't request pointer dummy_addr when not required
    dax/super: Do not request a pointer kaddr when not required
    tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
    s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
    libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
    acpi/nfit: queue issuing of ars when an uc error notification comes in
    libnvdimm: Export max available extent
    libnvdimm: Use max contiguous area for namespace size
    MAINTAINERS: Add Jan Kara for filesystem DAX
    MAINTAINERS: update Ross Zwisler's email address
    tools/testing/nvdimm: Fix support for emulating controller temperature
    tools/testing/nvdimm: Make DSM failure code injection an override
    acpi, nfit: Prefer _DSM over _LSR for namespace label reads
    libnvdimm: Introduce locked DIMM capacity support

    Linus Torvalds
     

18 Aug, 2018

1 commit

  • This patch is reworked from an earlier patch that Dan has posted:
    https://patchwork.kernel.org/patch/10131727/

    VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
    the memory page it is dealing with is not typical memory from the linear
    map. The get_user_pages_fast() path, since it does not resolve the vma,
    is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
    use that as a VM_MIXEDMAP replacement in some locations. In the cases
    where there is no pte to consult we fallback to using vma_is_dax() to
    detect the VM_MIXEDMAP special case.

    Now that we have explicit driver pfn_t-flag opt-in/opt-out for
    get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
    also means we no longer need to worry about safely manipulating vm_flags
    in a future where we support dynamically changing the dax mode of a
    file.

    DAX should also now be supported with madvise_behavior(), vma_merge(),
    and copy_page_range().

    This patch has been tested against ndctl unit test. It has also been
    tested against xfstests commit: 625515d using fake pmem created by
    memmap and no additional issues have been observed.

    Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Acked-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

01 Aug, 2018

1 commit

  • dax_pmem_percpu_exit() waits for dax_pmem_percpu_release() to invoke the
    dax_pmem->cmp completion. Unfortunately this approach to cleaning up
    the percpu_ref only works after devm_memremap_pages() was successful.

    If devm_add_action_or_reset() or devm_memremap_pages() fails,
    dax_pmem_percpu_release() is not invoked. Therefore
    dax_pmem_percpu_exit() hangs waiting for the completion:

    rc = devm_add_action_or_reset(dev, dax_pmem_percpu_exit,
    &dax_pmem->ref);
    if (rc)
    return rc;

    dax_pmem->pgmap.ref = &dax_pmem->ref;
    addr = devm_memremap_pages(dev, &dax_pmem->pgmap);

    Avoid the hang by calling percpu_ref_exit() in the error paths instead
    of going through dax_pmem_percpu_exit().

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Dave Jiang

    Stefan Hajnoczi
     

31 Jul, 2018

1 commit


21 Jul, 2018

3 commits

  • In support of enabling memory_failure() handling for device-dax
    mappings, set ->index to the pgoff of the page. The rmap implementation
    requires ->index to bound the search through the vma interval tree.

    The ->index value is never cleared. There is no possibility for the
    page to become associated with another pgoff while the device is
    enabled. When the device is disabled the 'struct page' array for the
    device is destroyed and ->index is reinitialized to zero.

    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     
  • In support of enabling memory_failure() handling for device-dax
    mappings, set the ->mapping association of pages backing device-dax
    mappings. The rmap implementation requires page_mapping() to return the
    address_space hosting the vmas that map the page.

    The ->mapping pointer is never cleared. There is no possibility for the
    page to become associated with another address_space while the device is
    enabled. When the device is disabled the 'struct page' array for the
    device is destroyed / later reinitialized to zero.

    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     
  • Use new return type vm_fault_t for fault and huge_fault handler. For
    now, this is just documenting that the function returns a VM_FAULT value
    rather than an errno. Once all instances are converted, vm_fault_t will
    become a distinct type.

    Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    Previously vm_insert_mixed() returned an error code which driver mapped into
    VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
    inefficiency by returning VM_FAULT_* type.

    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     

14 Jul, 2018

1 commit

  • Pull libnvdimm fixes from Dave Jiang:

    - ensure that a variable passed in by reference to acpi_nfit_ctl is
    always set to a value. An incremental patch is provided due to notice
    from testing in -next. The rest of the commits did not exhibit
    issues.

    - fix a return path in nsio_rw_bytes() that was not returning "bytes
    remain" as expected for the function.

    - address an issue where applications polling on scrub-completion for
    the NVDIMM may falsely wakeup and read the wrong state value and
    cause hang.

    - change the test unit persistent capability attribute to fix up a
    broken assumption in the unit test infrastructure wrt the
    'write_cache' attribute

    - ratelimit dev_info() in the dax device check_vma() function since
    this is easily triggered from userspace

    * tag 'libnvdimm-fixes-4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    nfit: fix unchecked dereference in acpi_nfit_ctl
    acpi, nfit: Fix scrub idle detection
    tools/testing/nvdimm: advertise a write cache for nfit_test
    acpi/nfit: fix cmd_rc for acpi_nfit_ctl to always return a value
    dev-dax: check_vma: ratelimit dev_info-s
    libnvdimm, pmem: Fix memcpy_mcsafe() return code handling in nsio_rw_bytes()

    Linus Torvalds
     

29 Jun, 2018

2 commits

  • This is easily triggered from userspace, so let's ratelimit the
    messages.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Dan Williams

    Jeff Moyer
     
  • Add an explicit check for QUEUE_FLAG_DAX to __bdev_dax_supported(). This
    is needed for DM configurations where the first element in the dm-linear or
    dm-stripe target supports DAX, but other elements do not. Without this
    check __bdev_dax_supported() will pass for such devices, letting a
    filesystem on that device mount with the DAX option.

    Signed-off-by: Ross Zwisler
    Suggested-by: Mike Snitzer
    Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
    Cc: stable@vger.kernel.org
    Acked-by: Dan Williams
    Reviewed-by: Toshi Kani
    Signed-off-by: Mike Snitzer

    Ross Zwisler
     

09 Jun, 2018

3 commits

  • Pull libnvdimm updates from Dan Williams:
    "This adds a user for the new 'bytes-remaining' updates to
    memcpy_mcsafe() that you already received through Ingo via the
    x86-dax- for-linus pull.

    Not included here, but still targeting this cycle, is support for
    handling memory media errors (poison) consumed via userspace dax
    mappings.

    Summary:

    - DAX broke a fundamental assumption of truncate of file mapped
    pages. The truncate path assumed that it is safe to disconnect a
    pinned page from a file and let the filesystem reclaim the physical
    block. With DAX the page is equivalent to the filesystem block.
    Introduce dax_layout_busy_page() to enable filesystems to wait for
    pinned DAX pages to be released. Without this wait a filesystem
    could allocate blocks under active device-DMA to a new file.

    - DAX arranges for the block layer to be bypassed and uses
    dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
    However, the memcpy_mcsafe() facility is available through the pmem
    block driver. In order to safely handle media errors, via the DAX
    block-layer bypass, introduce copy_to_iter_mcsafe().

    - Fix cache management policy relative to the ACPI NFIT Platform
    Capabilities Structure to properly elide cache flushes when they
    are not necessary. The table indicates whether CPU caches are
    power-fail protected. Clarify that a deep flush is always performed
    on REQ_{FUA,PREFLUSH} requests"

    * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    dax: Use dax_write_cache* helpers
    libnvdimm, pmem: Do not flush power-fail protected CPU caches
    libnvdimm, pmem: Unconditionally deep flush on *sync
    libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
    acpi, nfit: Remove ecc_unit_size
    dax: dax_insert_mapping_entry always succeeds
    libnvdimm, e820: Register all pmem resources
    libnvdimm: Debug probe times
    linvdimm, pmem: Preserve read-only setting for pmem devices
    x86, nfit_test: Add unit test for memcpy_mcsafe()
    pmem: Switch to copy_to_iter_mcsafe()
    dax: Report bytes remaining in dax_iomap_actor()
    dax: Introduce a ->copy_to_iter dax operation
    uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
    xfs, dax: introduce xfs_break_dax_layouts()
    xfs: prepare xfs_break_layouts() for another layout type
    xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
    mm, fs, dax: handle layout changes to pinned dax mappings
    mm: fix __gup_device_huge vs unmap
    mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
    ...

    Linus Torvalds
     
  • Dan Williams
     
  • Dan Williams
     

07 Jun, 2018

3 commits

  • Pull overflow updates from Kees Cook:
    "This adds the new overflow checking helpers and adds them to the
    2-factor argument allocators. And this adds the saturating size
    helpers and does a treewide replacement for the struct_size() usage.
    Additionally this adds the overflow testing modules to make sure
    everything works.

    I'm still working on the treewide replacements for allocators with
    "simple" multiplied arguments:

    *alloc(a * b, ...) -> *alloc_array(a, b, ...)

    and

    *zalloc(a * b, ...) -> *calloc(a, b, ...)

    as well as the more complex cases, but that's separable from this
    portion of the series. I expect to have the rest sent before -rc1
    closes; there are a lot of messy cases to clean up.

    Summary:

    - Introduce arithmetic overflow test helper functions (Rasmus)

    - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

    - Introduce overflow test module (Rasmus, Kees)

    - Introduce saturating size helper functions (Matthew, Kees)

    - Treewide use of struct_size() for allocators (Kees)"

    * tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    treewide: Use struct_size() for devm_kmalloc() and friends
    treewide: Use struct_size() for vmalloc()-family
    treewide: Use struct_size() for kmalloc()-family
    device: Use overflow helpers for devm_kmalloc()
    mm: Use overflow helpers in kvmalloc()
    mm: Use overflow helpers in kmalloc_array*()
    test_overflow: Add memory allocation overflow tests
    overflow.h: Add allocation size calculation helpers
    test_overflow: Report test failures
    test_overflow: macrofy some more, do more tests for free
    lib: add runtime test of check_*_overflow functions
    compiler.h: enable builtin overflow checkers and add fallback code

    Linus Torvalds
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    void *entry[];
    };

    instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
    uses. It was done via automatic conversion with manual review for the
    "CHECKME" non-standard cases noted below, using the following Coccinelle
    script:

    // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
    // sizeof *pkey_cache->table, GFP_KERNEL);
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    identifier VAR, ELEMENT;
    expression COUNT;
    @@

    - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
    + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

    // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    identifier VAR, ELEMENT;
    expression COUNT;
    @@

    - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
    + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

    // Same pattern, but can't trivially locate the trailing element name,
    // or variable name.
    @@
    identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
    expression GFP;
    expression SOMETHING, COUNT, ELEMENT;
    @@

    - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
    + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

    Signed-off-by: Kees Cook

    Kees Cook
     
  • Use dax_write_cache() and dax_write_cache_enabled() instead of open coding
    the bit operations.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Ross Zwisler
     

31 May, 2018

2 commits

  • The function return values are confusing with the way the function is
    named. We expect a true or false return value but it actually returns
    0/-errno. This makes the code very confusing. Changing the return values
    to return a bool where if DAX is supported then return true and no DAX
    support returns false.

    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Jiang
     
  • Change bdev_dax_supported so it takes a bdev parameter. This enables
    multi-device filesystems like xfs to check that a dax device can work for
    the particular filesystem. Once that's in place, actually fix all the
    parts of XFS where we need to be able to distinguish between datadev and
    rtdev.

    This patch fixes the problem where we screw up the dax support checking
    in xfs if the datadev and rtdev have different dax capabilities.

    Signed-off-by: Darrick J. Wong
    [rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
    Signed-off-by: Ross Zwisler
    Reviewed-by: Eric Sandeen

    Darrick J. Wong
     

23 May, 2018

1 commit

  • Similar to the ->copy_from_iter() operation, a platform may want to
    deploy an architecture or device specific routine for handling reads
    from a dax_device like /dev/pmemX. On x86 this routine will point to a
    machine check safe version of copy_to_iter(). For now, add the plumbing
    to device-mapper and the dax core.

    Cc: Ross Zwisler
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

22 May, 2018

1 commit

  • In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
    be able to rely on the fact that they will get wakeups on dev_pagemap
    page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
    generic_dax_page_free() as common indicator / infrastructure for dax
    filesytems to require. With this change there are no users of the
    MEMORY_DEVICE_HOST designation, so remove it.

    The HMM sub-system extended dev_pagemap to arrange a callback when a
    dev_pagemap managed page is freed. Since a dev_pagemap page is free /
    idle when its reference count is 1 it requires an additional branch to
    check the page-type at put_page() time. Given put_page() is a hot-path
    we do not want to incur that check if HMM is not in use, so a static
    branch is used to avoid that overhead when not necessary.

    Now, the FS_DAX implementation wants to reuse this mechanism for
    receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
    static-key into a generic mechanism that either HMM or FS_DAX code paths
    can enable.

    For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
    care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
    However, we still need to support FS_DAX in the FS_DAX_LIMITED case
    implemented by the s390/dcssblk driver.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michal Hocko
    Reported-by: kbuild test robot
    Reported-by: Thomas Meyer
    Reported-by: Dave Jiang
    Cc: "Jérôme Glisse"
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

20 Apr, 2018

1 commit

  • MAP_SYNC is a nop for device-dax. Allow MAP_SYNC to succeed on device-dax
    to eliminate special casing between device-dax and fs-dax as to when the
    flag can be specified. Device-dax users already implicitly assume that they do
    not need to call fsync(), and this enables them to explicitly check for this
    capability.

    Cc:
    Fixes: b6fb293f2497 ("mm: Define MAP_SYNC and VM_SYNC flags")
    Signed-off-by: Dave Jiang
    Reviewed-by: Dan Williams
    Signed-off-by: Dan Williams

    Dave Jiang
     

11 Apr, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This cycle was was not something I ever want to repeat as there were
    several late changes that have only now just settled.

    Half of the branch up to commit d2c997c0f145 ("fs, dax: use
    page->mapping to warn...") have been in -next for several releases.
    The of_pmem driver and the address range scrub rework were late
    arrivals, and the dax work was scaled back at the last moment.

    The of_pmem driver missed a previous merge window due to an oversight.
    A sense of obligation to rectify that miss is why it is included for
    4.17. It has acks from PowerPC folks. Stephen reported a build failure
    that only occurs when merging it with your latest tree, for now I have
    fixed that up by disabling modular builds of of_pmem. A test merge
    with your tree has received a build success report from the 0day robot
    over 156 configs.

    An initial version of the ARS rework was submitted before the merge
    window. It is self contained to libnvdimm, a net code reduction, and
    passing all unit tests.

    The filesystem-dax changes are based on the wait_var_event()
    functionality from tip/sched/core. However, late review feedback
    showed that those changes regressed truncate performance to a large
    degree. The branch was rewound to drop the truncate behavior change
    and now only includes preparation patches and cleanups (with full acks
    and reviews). The finalization of this dax-dma-vs-trnucate work will
    need to wait for 4.18.

    Summary:

    - A rework of the filesytem-dax implementation provides for detection
    of unmap operations (truncate / hole punch) colliding with
    in-progress device-DMA. A fix for these collisions remains a
    work-in-progress pending resolution of truncate latency and
    starvation regressions.

    - The of_pmem driver expands the users of libnvdimm outside of x86
    and ACPI to describe an implementation of persistent memory on
    PowerPC with Open Firmware / Device tree.

    - Address Range Scrub (ARS) handling is completely rewritten to
    account for the fact that ARS may run for 100s of seconds and there
    is no platform defined way to cancel it. ARS will now no longer
    block namespace initialization.

    - The NVDIMM Namespace Label implementation is updated to handle
    label areas as small as 1K, down from 128K.

    - Miscellaneous cleanups and updates to unit test infrastructure"

    * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
    libnvdimm, of_pmem: workaround OF_NUMA=n build error
    nfit, address-range-scrub: add module option to skip initial ars
    nfit, address-range-scrub: rework and simplify ARS state machine
    nfit, address-range-scrub: determine one platform max_ars value
    powerpc/powernv: Create platform devs for nvdimm buses
    doc/devicetree: Persistent memory region bindings
    libnvdimm: Add device-tree based driver
    libnvdimm: Add of_node to region and bus descriptors
    libnvdimm, region: quiet region probe
    libnvdimm, namespace: use a safe lookup for dimm device name
    libnvdimm, dimm: fix dpa reservation vs uninitialized label area
    libnvdimm, testing: update the default smart ctrl_temperature
    libnvdimm, testing: Add emulation for smart injection commands
    nfit, address-range-scrub: introduce nfit_spa->ars_state
    libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
    nfit, address-range-scrub: fix scrub in-progress reporting
    dax, dm: allow device-mapper to operate without dax support
    dax: introduce CONFIG_DAX_DRIVER
    fs, dax: use page->mapping to warn if truncate collides with a busy page
    ext2, dax: introduce ext2_dax_aops
    ...

    Linus Torvalds
     

10 Apr, 2018

1 commit


06 Apr, 2018

1 commit

  • Given that device-dax is making similar page mapping size guarantees as
    hugetlbfs, emit the size in smaps and any other kernel path that
    requests the mapping size of a vma.

    Link: http://lkml.kernel.org/r/151996255287.27922.18397777516059080245.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Jane Chu
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

03 Apr, 2018

1 commit

  • In support of allowing device-mapper to compile out idle/dead code when
    there are no dax providers in the system, introduce the DAX_DRIVER
    symbol. This is selected by all leaf drivers that device-mapper might be
    layered on top. This allows device-mapper to conditionally 'select DAX'
    only when a provider is present.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Reported-by: Bart Van Assche
    Reviewed-by: Mike Snitzer
    Signed-off-by: Dan Williams

    Dan Williams
     

31 Mar, 2018

1 commit

  • In preparation for examining the busy state of dax pages in the truncate
    path, switch from sectors to pfns in the radix.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

15 Mar, 2018

1 commit


07 Mar, 2018

1 commit

  • Dynamic debug can be instructed to add the function name to the debug
    output using the +f switch, so there is no need for the dax modules to
    do it again. If a user decides to add the +f switch for the dax modules'
    dynamic debug this results in double prints of the function name.

    Reported-by: Johannes Thumshirn
    Reported-by: Ross Zwisler
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

27 Feb, 2018

1 commit


03 Feb, 2018

1 commit


24 Jan, 2018

1 commit


20 Jan, 2018

1 commit

  • If a dax buffer from a device that does not map pages is passed to
    read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
    gdb attempts to examine the contents of a dax buffer from a device that
    does not map pages it triggers SIGBUS. If fork(2) is called on a process
    with a dax mapping from a device that does not map pages it triggers
    SIGBUS. 'struct page' is required otherwise several kernel code paths
    break in surprising ways. Disable filesystem-dax on devices that do not
    map pages.

    In addition to needing pfn_to_page() to be valid we also require devmap
    pages. We need this to detect dax pages in the get_user_pages_fast()
    path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
    drivers that have not supported get_user_pages() to date we allow them
    to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
    option which requires ->direct_access() to return pfn_t_special() pfns.
    This leaves DAX support in brd disabled and scheduled for removal.

    Note that when the initial dax support was being merged a few years back
    there was concern that struct page was unsuitable for use with next
    generation persistent memory devices. The theoretical concern was that
    struct page access, being such a hotly used data structure in the
    kernel, would lead to media wear out. While that was a reasonable
    conservative starting position it has not held true in practice. We have
    long since committed to using devm_memremap_pages() to support higher
    order kernel functionality that needs get_user_pages() and
    pfn_to_page().

    Cc: Jeff Moyer
    Cc: Ross Zwisler
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Gerald Schaefer
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Jan, 2018

1 commit

  • This new interface is similar to how struct device (and many others)
    work. The caller initializes a 'struct dev_pagemap' as required
    and calls 'devm_memremap_pages'. This allows the pagemap structure to
    be embedded in another structure and thus container_of can be used. In
    this way application specific members can be stored in a containing
    struct.

    This will be used by the P2P infrastructure and HMM could probably
    be cleaned up to use it as well (instead of having it's own, similar
    'hmm_devmem_pages_create' function).

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Christoph Hellwig
     

30 Nov, 2017

1 commit

  • Similar to how device-dax enforces that the 'address', 'offset', and
    'len' parameters to mmap() be aligned to the device's fundamental
    alignment, the same constraints apply to munmap(). Implement ->split()
    to fail munmap calls that violate the alignment constraint.

    Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path
    with crash signatures of the form:

    vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
    next (null) prev (null) mm ffff8800b61150c0
    prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240
    pgoff 0 file ffff8800b638ef80 private_data (null)
    flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:2014!
    [..]
    RIP: 0010:__split_huge_pud+0x12a/0x180
    [..]
    Call Trace:
    unmap_page_range+0x245/0xa40
    ? __vma_adjust+0x301/0x990
    unmap_vmas+0x4c/0xa0
    unmap_region+0xae/0x120
    ? __vma_rb_erase+0x11a/0x230
    do_munmap+0x276/0x410
    vm_munmap+0x6a/0xa0
    SyS_munmap+0x1d/0x30

    Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Reported-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

18 Nov, 2017

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "Save for a few late fixes, all of these commits have shipped in -next
    releases since before the merge window opened, and 0day has given a
    build success notification.

    The ext4 touches came from Jan, and the xfs touches have Darrick's
    reviewed-by. An xfstest for the MAP_SYNC feature has been through
    a few round of reviews and is on track to be merged.

    - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
    'userspace flush' of persistent memory updates via filesystem-dax
    mappings. It arranges for any filesystem metadata updates that may
    be required to satisfy a write fault to also be flushed ("on disk")
    before the kernel returns to userspace from the fault handler.
    Effectively every write-fault that dirties metadata completes an
    fsync() before returning from the fault handler. The new
    MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
    is validated as supported by the filesystem's ->mmap() file
    operation.

    - Add support for the standard ACPI 6.2 label access methods that
    replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
    This enables interoperability with environments that only implement
    the standardized methods.

    - Add support for the ACPI 6.2 NVDIMM media error injection methods.

    - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
    latch last shutdown status, firmware update, SMART error injection,
    and SMART alarm threshold control.

    - Cleanup physical address information disclosures to be root-only.

    - Fix revalidation of the DIMM "locked label area" status to support
    dynamic unlock of the label area.

    - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
    (system-physical-address) command and error injection commands.

    Acknowledgements that came after the commits were pushed to -next:

    - 957ac8c421ad ("dax: fix PMD faults on zero-length files"):
    Reviewed-by: Ross Zwisler

    - a39e596baa07 ("xfs: support for synchronous DAX faults") and
    7b565c9f965b ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
    Reviewed-by: Darrick J. Wong "

    * tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
    acpi, nfit: add 'Enable Latch System Shutdown Status' command support
    dax: fix general protection fault in dax_alloc_inode
    dax: fix PMD faults on zero-length files
    dax: stop requiring a live device for dax_flush()
    brd: remove dax support
    dax: quiet bdev_dax_supported()
    fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
    tools/testing/nvdimm: unit test clear-error commands
    acpi, nfit: validate commands against the device type
    tools/testing/nvdimm: stricter bounds checking for error injection commands
    xfs: support for synchronous DAX faults
    xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
    ext4: Support for synchronous DAX faults
    ext4: Simplify error handling in ext4_dax_huge_fault()
    dax: Implement dax_finish_sync_fault()
    dax, iomap: Add support for synchronous faults
    mm: Define MAP_SYNC and VM_SYNC flags
    dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
    dax: Allow dax_iomap_fault() to return pfn
    dax: Fix comment describing dax_iomap_fault()
    ...

    Linus Torvalds
     

16 Nov, 2017

1 commit