24 Aug, 2018

1 commit

  • [ Upstream commit 5a14e91d559aee5bdb0e002e1153fd9c4338a29e ]

    This is easily triggered from userspace, so let's ratelimit the
    messages.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Dan Williams
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     

11 Jul, 2018

3 commits

  • commit 15256f6cc4b44f2e70503758150267fd2a53c0d6 upstream.

    Add an explicit check for QUEUE_FLAG_DAX to __bdev_dax_supported(). This
    is needed for DM configurations where the first element in the dm-linear or
    dm-stripe target supports DAX, but other elements do not. Without this
    check __bdev_dax_supported() will pass for such devices, letting a
    filesystem on that device mount with the DAX option.

    Signed-off-by: Ross Zwisler
    Suggested-by: Mike Snitzer
    Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
    Cc: stable@vger.kernel.org
    Acked-by: Dan Williams
    Reviewed-by: Toshi Kani
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ross Zwisler
     
  • commit 80660f20252d6f76c9f203874ad7c7a4a8508cf8 upstream.

    The function return values are confusing with the way the function is
    named. We expect a true or false return value but it actually returns
    0/-errno. This makes the code very confusing. Changing the return values
    to return a bool where if DAX is supported then return true and no DAX
    support returns false.

    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Dave Jiang
     
  • commit ba23cba9b3bdc967aabdc6ff1e3e9b11ce05bb4f upstream.

    Change bdev_dax_supported so it takes a bdev parameter. This enables
    multi-device filesystems like xfs to check that a dax device can work for
    the particular filesystem. Once that's in place, actually fix all the
    parts of XFS where we need to be able to distinguish between datadev and
    rtdev.

    This patch fixes the problem where we screw up the dax support checking
    in xfs if the datadev and rtdev have different dax capabilities.

    Signed-off-by: Darrick J. Wong
    [rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
    Signed-off-by: Ross Zwisler
    Reviewed-by: Eric Sandeen
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

20 Dec, 2017

1 commit

  • [ Upstream commit 0a3ff78699d1817e711441715d22665475466036 ]

    Fix this build warning:

    warning: 'phys' may be used uninitialized in this function
    [-Wuninitialized]

    As reported here:

    https://lkml.org/lkml/2017/10/16/152
    http://kisskb.ellerman.id.au/kisskb/buildresult/13181373/log/

    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ross Zwisler
     

05 Dec, 2017

1 commit

  • commit 9702cffdbf2129516db679e4467db81e1cd287da upstream.

    Similar to how device-dax enforces that the 'address', 'offset', and
    'len' parameters to mmap() be aligned to the device's fundamental
    alignment, the same constraints apply to munmap(). Implement ->split()
    to fail munmap calls that violate the alignment constraint.

    Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path
    with crash signatures of the form:

    vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
    next (null) prev (null) mm ffff8800b61150c0
    prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240
    pgoff 0 file ffff8800b638ef80 private_data (null)
    flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:2014!
    [..]
    RIP: 0010:__split_huge_pud+0x12a/0x180
    [..]
    Call Trace:
    unmap_page_range+0x245/0xa40
    ? __vma_adjust+0x301/0x990
    unmap_vmas+0x4c/0xa0
    unmap_region+0xae/0x120
    ? __vma_rb_erase+0x11a/0x230
    do_munmap+0x276/0x410
    vm_munmap+0x6a/0xa0
    SyS_munmap+0x1d/0x30

    Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Reported-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

30 Nov, 2017

1 commit

  • commit 9f586fff6574f6ecbf323f92d44ffaf0d96225fe upstream.

    Don't crash in case of allocation failure in dax_alloc_inode.

    syzkaller hit the following crash on e4880bc5dfb1

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    [..]
    RIP: 0010:dax_alloc_inode+0x3b/0x70 drivers/dax/super.c:348
    Call Trace:
    alloc_inode+0x65/0x180 fs/inode.c:208
    new_inode_pseudo+0x69/0x190 fs/inode.c:890
    new_inode+0x1c/0x40 fs/inode.c:919
    mount_pseudo_xattr+0x288/0x560 fs/libfs.c:261
    mount_pseudo include/linux/fs.h:2137 [inline]
    dax_mount+0x2e/0x40 drivers/dax/super.c:388
    mount_fs+0x66/0x2d0 fs/super.c:1223

    Fixes: 7b6be8444e0f ("dax: refactor dax-fs into a generic provider...")
    Reported-by: syzbot
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Some request-based DM core and DM multipath fixes and cleanups

    - Constify a few variables in DM core and DM integrity

    - Add bufio optimization and checksum failure accounting to DM
    integrity

    - Fix DM integrity to avoid checking integrity of failed reads

    - Fix DM integrity to use init_completion

    - A couple DM log-writes target fixes

    - Simplify DAX flushing by eliminating the unnecessary flush
    abstraction that was stood up for DM's use.

    * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dax: remove the pmem_dax_ops->flush abstraction
    dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
    dm integrity: make blk_integrity_profile structure const
    dm integrity: do not check integrity for failed read operations
    dm log writes: fix >512b sectorsize support
    dm log writes: don't use all the cpu while waiting to log blocks
    dm ioctl: constify ioctl lookup table
    dm: constify argument arrays
    dm integrity: count and display checksum failures
    dm integrity: optimize writing dm-bufio buffers that are partially changed
    dm rq: do not update rq partially in each ending bio
    dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
    dm mpath: complain about unsupported __multipath_map_bio() return values
    dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through

    Linus Torvalds
     

11 Sep, 2017

1 commit

  • Commit abebfbe2f731 ("dm: add ->flush() dax operation support") is
    buggy. A DM device may be composed of multiple underlying devices and
    all of them need to be flushed. That commit just routes the flush
    request to the first device and ignores the other devices.

    It could be fixed by adding more complex logic to the device mapper. But
    there is only one implementation of the method pmem_dax_ops->flush - that
    is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
    don't need the pmem_dax_ops->flush abstraction at all, we can call
    arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
    can't ever reach anything different from arch_wb_cache_pmem().

    It should be also pointed out that for some uses of persistent memory it
    is needed to flush only a very small amount of data (such as 1 cacheline),
    and it would be overkill if we go through that device mapper machinery for
    a single flushed cache line.

    Fix this by removing the pmem_dax_ops->flush abstraction and call
    arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
    mapper code that forwards the flushes.

    Fixes: abebfbe2f731 ("dm: add ->flush() dax operation support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Dan Williams
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

04 Sep, 2017

1 commit

  • The 0day kbuild robot reports:

    >> drivers//dax/super.c:64:20: error: redefinition of 'fs_dax_get_by_bdev'
    struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
    ^~~~~~~~~~~~~~~~~~
    In file included from drivers//dax/super.c:22:0:
    include/linux/dax.h:76:34: note: previous definition of 'fs_dax_get_by_bdev' was here
    static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
    ^~~~~~~~~~~~~~~~~~

    Protect the definition of fs_dax_get_by_bdev() in drivers/dax/super.c
    with an ifdef.

    Fixes: 78f354735081 ("dax: introduce a fs_dax_get_by_bdev() helper")
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Reviewed-by: Jan Kara
    Reported-by: kbuild test robot
    Signed-off-by: Dan Williams

    Dan Williams
     

31 Aug, 2017

1 commit

  • Add a helper that can replace the following common pattern:

    if (blk_queue_dax(bdev->bd_queue))
    fs_dax_get_by_host(bdev->bd_disk->disk_name);

    This will be used to move dax_device lookup from iomap-operation time to
    fs-mount time.

    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dan Williams

    Dan Williams
     

29 Jul, 2017

1 commit

  • …evice-mapper/linux-dm

    Pull device mapper fixes from Mike Snitzer:

    - a few DM integrity fixes that improve performance. One that address
    inefficiencies in the on-disk journal device layout. Another that
    makes use of the block layer's on-stack plugging when writing the
    journal.

    - a dm-bufio fix for the blk_status_t conversion that went in during
    the merge window.

    - a few DM raid fixes that address correctness when suspending the
    device and a validation fix for validation that occurs during device
    activation.

    - a couple DM zoned target fixes. Important one being the fix to not
    use GFP_KERNEL in the IO path due to concerns about deadlock in
    low-memory conditions (e.g. swap over a DM zoned device, etc).

    - a DM DAX device fix to make sure dm_dax_flush() is called if the
    underlying DAX device is operating as a write cache.

    * tag 'for-4.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm, dax: Make sure dm_dax_flush() is called if device supports it
    dm verity fec: fix GFP flags used with mempool_alloc()
    dm zoned: use GFP_NOIO in I/O path
    dm zoned: remove test for impossible REQ_OP_FLUSH conditions
    dm raid: bump target version
    dm raid: avoid mddev->suspended access
    dm raid: fix activation check in validate_raid_redundancy()
    dm raid: remove WARN_ON() in raid10_md_layout_to_format()
    dm bufio: fix error code in dm_bufio_write_dirty_buffers()
    dm integrity: test for corrupted disk format during table load
    dm integrity: WARN_ON if variables representing journal usage get out of sync
    dm integrity: use plugging when writing the journal
    dm integrity: fix inefficient allocation of journal space

    Linus Torvalds
     

27 Jul, 2017

1 commit

  • Currently dm_dax_flush() is not being called, even if underlying dax
    device supports write cache, because DAXDEV_WRITE_CACHE is not being
    propagated up to the DM dax device.

    If the underlying dax device supports write cache, set
    DAXDEV_WRITE_CACHE on the DM dax device. This will cause dm_dax_flush()
    to be called.

    Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support")
    Signed-off-by: Vivek Goyal
    Acked-by: Dan Williams
    Signed-off-by: Mike Snitzer

    Vivek Goyal
     

19 Jul, 2017

1 commit

  • Fix warnings of the form...

    WARNING: CPU: 10 PID: 4983 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
    sysfs: cannot create duplicate filename '/class/dax/dax12.0'
    Call Trace:
    dump_stack+0x63/0x86
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5a/0x80
    ? kernfs_path_from_node+0x4f/0x60
    sysfs_warn_dup+0x62/0x80
    sysfs_do_create_link_sd.isra.2+0x97/0xb0
    sysfs_create_link+0x25/0x40
    device_add+0x266/0x630
    devm_create_dax_dev+0x2cf/0x340 [dax]
    dax_pmem_probe+0x1f5/0x26e [dax_pmem]
    nvdimm_bus_probe+0x71/0x120

    ...by reusing the namespace id for the device-dax instance name.

    Now that we have decided that there will never by more than one
    device-dax instance per libnvdimm-namespace parent device [1], we can
    directly reuse the namepace ids. There are some possible follow-on
    cleanups, but those are saved for a later patch to simplify the -stable
    backport.

    [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-December/008266.html

    Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple pmem...")
    Cc: Jeff Moyer
    Cc:
    Reported-by: Dariusz Dokupil
    Signed-off-by: Dan Williams

    Dan Williams
     

18 Jul, 2017

1 commit

  • Dan Carpenter reports:

    The patch 7b6be8444e0f: "dax: refactor dax-fs into a generic provider
    of 'struct dax_device' instances" from Apr 11, 2017, leads to the
    following static checker warning:

    drivers/dax/device.c:643 devm_create_dev_dax()
    warn: passing zero to 'ERR_PTR'

    Fix the case where we inadvertently leak 0 to ERR_PTR() by setting at
    every error case, and make it clear that 'count' is never 0.

    Reported-by: Dan Carpenter
    Signed-off-by: Dan Williams

    Dan Williams
     

08 Jul, 2017

2 commits

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:
    "libnvdimm updates for the latest ACPI and UEFI specifications. This
    pull request also includes new 'struct dax_operations' enabling to
    undo the abuse of copy_user_nocache() for copy operations to pmem.

    The dax work originally missed 4.12 to address concerns raised by Al.

    Summary:

    - Introduce the _flushcache() family of memory copy helpers and use
    them for persistent memory write operations on x86. The
    _flushcache() semantic indicates that the cache is either bypassed
    for the copy operation (movnt) or any lines dirtied by the copy
    operation are written back (clwb, clflushopt, or clflush).

    - Extend dax_operations with ->copy_from_iter() and ->flush()
    operations. These operations and other infrastructure updates allow
    all persistent memory specific dax functionality to be pushed into
    libnvdimm and the pmem driver directly. It also allows dax-specific
    sysfs attributes to be linked to a host device, for example:
    /sys/block/pmem0/dax/write_cache

    - Add support for the new NVDIMM platform/firmware mechanisms
    introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
    namespace label format, extensions to the address-range-scrub
    command set, new error injection commands, and a new BTT
    (block-translation-table) layout. These updates support inter-OS
    and pre-OS compatibility.

    - Fix a longstanding memory corruption bug in nfit_test.

    - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
    capable.

    - Miscellaneous fixes and small updates across libnvdimm and the nfit
    driver.

    Acknowledgements that came after the branch was pushed: commit
    6aa734a2f38e ("libnvdimm, region, pmem: fix 'badblocks'
    sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
    "

    * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
    libnvdimm, namespace: record 'lbasize' for pmem namespaces
    acpi/nfit: Issue Start ARS to retrieve existing records
    libnvdimm: New ACPI 6.2 DSM functions
    acpi, nfit: Show bus_dsm_mask in sysfs
    libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
    acpi, nfit: Enable DSM pass thru for root functions.
    libnvdimm: passthru functions clear to send
    libnvdimm, btt: convert some info messages to warn/err
    libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
    libnvdimm: fix the clear-error check in nsio_rw_bytes
    libnvdimm, btt: fix btt_rw_page not returning errors
    acpi, nfit: quiet invalid block-aperture-region warnings
    libnvdimm, btt: BTT updates for UEFI 2.7 format
    acpi, nfit: constify *_attribute_group
    libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
    libnvdimm, pmem, dax: export a cache control attribute
    dax: convert to bitmask for flags
    dax: remove default copy_from_iter fallback
    libnvdimm, nfit: enable support for volatile ranges
    libnvdimm, pmem: fix persistence warning
    ...

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

30 Jun, 2017

2 commits

  • The dax_flush() operation can be turned into a nop on platforms where
    firmware arranges for cpu caches to be flushed on a power-fail event.
    The ACPI 6.2 specification defines a mechanism for the platform to
    indicate this capability so the kernel can select the proper default.
    However, for other platforms, the administrator must toggle this setting
    manually.

    Given this flush setting is a dax-specific mechanism we advertise it
    through a 'dax' attribute group hanging off a host device. For example,
    a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a
    'write_cache' attribute to control response to dax cache flush requests.
    This is similar to the 'queue/write_cache' attribute that appears under
    block devices.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Suggested-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for adding more flags, convert the existing flag to a
    bit-flag.

    Signed-off-by: Dan Williams

    Dan Williams
     

28 Jun, 2017

1 commit

  • Require all dax-drivers to register a ->copy_from_iter() operation so
    that it is clear which dax_operations are optional and which must be
    implemented for filesystem-dax to operate.

    Cc: Gerald Schaefer
    Suggested-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jun, 2017

1 commit

  • Allow device-mapper to route flush operations to the
    per-target implementation. In order for the device stacking to work we
    need a dax_dev and a pgoff relative to that device. This gives each
    layer of the stack the information it needs to look up the operation
    pointer for the next level.

    This conceptually allows for an array of mixed device drivers with
    varying flush implementations.

    Reviewed-by: Toshi Kani
    Reviewed-by: Mike Snitzer
    Signed-off-by: Dan Williams

    Dan Williams
     

10 Jun, 2017

1 commit

  • Allow device-mapper to route copy_from_iter operations to the
    per-target implementation. In order for the device stacking to work we
    need a dax_dev and a pgoff relative to that device. This gives each
    layer of the stack the information it needs to look up the operation
    pointer for the next level.

    This conceptually allows for an array of mixed device drivers with
    varying copy_from_iter implementations.

    Reviewed-by: Toshi Kani
    Reviewed-by: Mike Snitzer
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Jun, 2017

1 commit

  • The inode destruction path for the 'dax' device filesystem incorrectly
    assumes that the inode was initialized through 'alloc_dax()'. However,
    if someone attempts to directly mount the dax filesystem with 'mount -t
    dax dax mnt' that will bypass 'alloc_dax()' and the following failure
    signatures may occur as a result:

    kill_dax() must be called before final iput()
    WARNING: CPU: 2 PID: 1188 at drivers/dax/super.c:243 dax_destroy_inode+0x48/0x50
    RIP: 0010:dax_destroy_inode+0x48/0x50
    Call Trace:
    destroy_inode+0x3b/0x60
    evict+0x139/0x1c0
    iput+0x1f9/0x2d0
    dentry_unlink_inode+0xc3/0x160
    __dentry_kill+0xcf/0x180
    ? dput+0x37/0x3b0
    dput+0x3a3/0x3b0
    do_one_tree+0x36/0x40
    shrink_dcache_for_umount+0x2d/0x90
    generic_shutdown_super+0x1f/0x120
    kill_anon_super+0x12/0x20
    deactivate_locked_super+0x43/0x70
    deactivate_super+0x4e/0x60

    general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    RIP: 0010:kfree+0x6d/0x290
    Call Trace:

    dax_i_callback+0x22/0x60
    ? dax_destroy_inode+0x50/0x50
    rcu_process_callbacks+0x298/0x740

    ida_remove called for id=0 which is not allocated.
    WARNING: CPU: 0 PID: 0 at lib/idr.c:383 ida_remove+0x110/0x120
    [..]
    Call Trace:

    ida_simple_remove+0x2b/0x50
    ? dax_destroy_inode+0x50/0x50
    dax_i_callback+0x3c/0x60
    rcu_process_callbacks+0x298/0x740

    Add missing initialization of the 'struct dax_device' and inode so that
    the destruction path does not kfree() or ida_simple_remove()
    uninitialized data.

    Fixes: 7b6be8444e0f ("dax: refactor dax-fs into a generic provider of 'struct dax_device' instances")
    Reported-by: Sasha Levin
    Signed-off-by: Dan Williams

    Dan Williams
     

14 May, 2017

1 commit

  • In the BLOCK=n case the dax core does not need to / must not emit the
    block-device-dax helpers. Otherwise it leads to compile errors.

    Cc: Arnd Bergmann
    Reported-by: Fabian Frederick
    Fixes: ef51042472f5 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
    Signed-off-by: Dan Williams

    Dan Williams
     

13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

10 May, 2017

1 commit

  • There is no point to ask how many device-dax instances the kernel should
    support. Since we are already using a dynamic major number, just allow
    the max number of minors by default and be done. This also fixes the
    fact that the proposed max for the NR_DEV_DAX range was larger than what
    could be supported by alloc_chrdev_region().

    Fixes: ba09c01d2fa8 ("dax: convert to the cdev api")
    Reported-by: Geert Uytterhoeven
    Tested-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

09 May, 2017

2 commits

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     
  • ERROR: "devm_create_dev_dax" [drivers/dax/dax_pmem.ko] undefined!
    ERROR: "alloc_dax_region" [drivers/dax/dax_pmem.ko] undefined!
    ERROR: "dax_region_put" [drivers/dax/dax_pmem.ko] undefined!

    Signed-off-by: Mike Galbraith
    Signed-off-by: Dan Williams

    Mike Galbraith
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

05 May, 2017

2 commits

  • Dan Williams
     
  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of new char/misc driver drivers and features for
    4.12-rc1.

    There's lots of new drivers added this time around, new firmware
    drivers from Google, more auxdisplay drivers, extcon drivers, fpga
    drivers, and a bunch of other driver updates. Nothing major, except if
    you happen to have the hardware for these drivers, and then you will
    be happy :)

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
    firmware: google memconsole: Fix return value check in platform_memconsole_init()
    firmware: Google VPD: Fix return value check in vpd_platform_init()
    goldfish_pipe: fix build warning about using too much stack.
    goldfish_pipe: An implementation of more parallel pipe
    fpga fr br: update supported version numbers
    fpga: region: release FPGA region reference in error path
    fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
    mei: drop the TODO from samples
    firmware: Google VPD sysfs driver
    firmware: Google VPD: import lib_vpd source files
    misc: lkdtm: Add volatile to intentional NULL pointer reference
    eeprom: idt_89hpesx: Add OF device ID table
    misc: ds1682: Add OF device ID table
    misc: tsl2550: Add OF device ID table
    w1: Remove unneeded use of assert() and remove w1_log.h
    w1: Use kernel common min() implementation
    uio_mf624: Align memory regions to page size and set correct offsets
    uio_mf624: Refactor memory info initialization
    uio: Allow handling of non page-aligned memory regions
    hangcheck-timer: Fix typo in comment
    ...

    Linus Torvalds
     

02 May, 2017

2 commits

  • Pull x86 mm updates from Ingo Molnar:
    "The main x86 MM changes in this cycle were:

    - continued native kernel PCID support preparation patches to the TLB
    flushing code (Andy Lutomirski)

    - various fixes related to 32-bit compat syscall returning address
    over 4Gb in applications, launched from 64-bit binaries - motivated
    by C/R frameworks such as Virtuozzo. (Dmitry Safonov)

    - continued Intel 5-level paging enablement: in particular the
    conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)

    - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)

    - ... plus misc updates, fixes and cleanups"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
    mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
    x86/mm: Fix flush_tlb_page() on Xen
    x86/mm: Make flush_tlb_mm_range() more predictable
    x86/mm: Remove flush_tlb() and flush_tlb_current_task()
    x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
    x86/mm/64: Fix crash in remove_pagetable()
    Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
    x86/boot/e820: Remove a redundant self assignment
    x86/mm: Fix dump pagetables for 4 levels of page tables
    x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
    x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
    Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
    x86/espfix: Add support for 5-level paging
    x86/kasan: Extend KASAN to support 5-level paging
    x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
    x86/paravirt: Add 5-level support to the paravirt code
    x86/mm: Define virtual memory map for 5-level paging
    x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
    x86/boot: Detect 5-level paging support
    x86/mm/numa: Remove numa_nodemask_from_meminfo()
    ...

    Linus Torvalds
     
  • Usage of device_lock() for dax_region attributes is unnecessary and
    deadlock prone. It's unnecessary because the order of registration /
    un-registration guarantees that drvdata is always valid. It's deadlock
    prone because it sets up this situation:

    ndctl D 0 2170 2082 0x00000000
    Call Trace:
    __schedule+0x31f/0x980
    schedule+0x3d/0x90
    schedule_preempt_disabled+0x15/0x20
    __mutex_lock+0x402/0x980
    ? __mutex_lock+0x158/0x980
    ? align_show+0x2b/0x80 [dax]
    ? kernfs_seq_start+0x2f/0x90
    mutex_lock_nested+0x1b/0x20
    align_show+0x2b/0x80 [dax]
    dev_attr_show+0x20/0x50

    ndctl D 0 2186 2079 0x00000000
    Call Trace:
    __schedule+0x31f/0x980
    schedule+0x3d/0x90
    __kernfs_remove+0x1f6/0x340
    ? kernfs_remove_by_name_ns+0x45/0xa0
    ? remove_wait_queue+0x70/0x70
    kernfs_remove_by_name_ns+0x45/0xa0
    remove_files.isra.1+0x35/0x70
    sysfs_remove_group+0x44/0x90
    sysfs_remove_groups+0x2e/0x50
    dax_region_unregister+0x25/0x40 [dax]
    devm_action_release+0xf/0x20
    release_nodes+0x16d/0x2b0
    devres_release_all+0x3c/0x60
    device_release_driver_internal+0x17d/0x220
    device_release_driver+0x12/0x20
    unbind_store+0x112/0x160

    ndctl/2170 is trying to acquire the device_lock() to read an attribute,
    and ndctl/2186 is holding the device_lock() while trying to drain all
    active attribute readers.

    Thanks to Yi Zhang for the reproduction script.

    Fixes: d7fe1a67f658 ("dax: add region 'id', 'size', and 'align' attributes")
    Cc:
    Reported-by: Yi Zhang
    Signed-off-by: Dan Williams

    Dan Williams
     

01 May, 2017

1 commit

  • The x86 conversion to the generic GUP code included a small change which causes
    crashes and data corruption in the pmem code - not good.

    The root cause is that the /dev/pmem driver code implicitly relies on the x86
    get_user_pages() implementation doing a get_page() on the page refcount, because
    get_page() does a get_zone_device_page() which properly refcounts pmem's separate
    page struct arrays that are not present in the regular page struct structures.
    (The pmem driver does this because it can cover huge memory areas.)

    But the x86 conversion to the generic GUP code changed the get_page() to
    page_cache_get_speculative() which is faster but doesn't do the
    get_zone_device_page() call the pmem code relies on.

    One way to solve the regression would be to change the generic GUP code to use
    get_page(), but that would slow things down a bit and punish other generic-GUP
    using architectures for an x86-ism they did not care about. (Arguably the pmem
    driver was probably not working reliably for them: but nvdimm is an Intel
    feature, so non-x86 exposure is probably still limited.)

    So restructure the pmem code's interface with the MM instead: get rid of the
    get/put_zone_device_page() distinction, integrate put_zone_device_page() into
    __put_page() and and restructure the pmem completion-wait and teardown machinery:

    Kirill points out that the calls to {get,put}_dev_pagemap() can be
    removed from the mm fast path if we take a single get_dev_pagemap()
    reference to signify that the page is alive and use the final put of the
    page to drop that reference.

    This does require some care to make sure that any waits for the
    percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
    since it now maintains its own elevated reference.

    This speeds up things while also making the pmem refcounting more robust going
    forward.

    Suggested-by: Kirill Shutemov
    Tested-by: Kirill Shutemov
    Signed-off-by: Dan Williams
    Reviewed-by: Logan Gunthorpe
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Jérôme Glisse
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar

    Dan Williams
     

21 Apr, 2017

1 commit

  • Replace bdev_direct_access() with dax_direct_access() that uses
    dax_device and dax_operations instead of a block_device and
    block_device_operations for dax. Once all consumers of the old api have
    been converted bdev_direct_access() will be deleted.

    Given that block device partitioning decisions can cause dax page
    alignment constraints to be violated this also introduces the
    bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
    to the dax_device and also checks for page alignment.

    Signed-off-by: Dan Williams

    Dan Williams
     

20 Apr, 2017

3 commits

  • Setup a dax_device to have the same lifetime as the pmem block device
    and add a ->direct_access() method that is equivalent to
    pmem_direct_access(). Once fs/dax.c has been converted to use
    dax_operations the old pmem_direct_access() will be removed.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Track a set of dax_operations per dax_device that can be set at
    alloc_dax() time. These operations will be used to stop the abuse of
    block_device_operations for communicating dax capabilities to
    filesystems. It will also be used to replace the "pmem api" and move
    pmem-specific cache maintenance, and other dax-driver-specific
    filesystem-dax operations, to dax device methods. In particular this
    allows us to stop abusing __copy_user_nocache(), via memcpy_to_pmem(),
    with a driver specific replacement.

    This is a standalone introduction of the operations. Follow on patches
    convert each dax-driver and teach fs/dax.c to use ->direct_access() from
    dax_operations instead of block_device_operations.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • For the current block_device based filesystem-dax path, we need a way
    for it to lookup the dax_device associated with a block_device. Add a
    'host' property of a dax_device that can be used for this purpose. It is
    a free form string, but for a dax_device associated with a block device
    it is the bdev name.

    This is a stop-gap until filesystems are able to mount on a dax-inode
    directly.

    Signed-off-by: Dan Williams

    Dan Williams