26 Apr, 2018

1 commit

  • [ Upstream commit ee190ca6516bc8257e3d36187ca6f0f71a9ec477 ]

    follow_pte_pmd() can theoretically return after having acquired a PMD
    lock, even when DAX was not compiled with CONFIG_FS_DAX_PMD.

    Release the PMD lock unconditionally.

    Link: http://lkml.kernel.org/r/20180118133839.20587-1-jschoenh@amazon.de
    Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean")
    Signed-off-by: Jan H. Schönherr
    Reviewed-by: Ross Zwisler
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

30 Nov, 2017

1 commit

  • commit 957ac8c421ad8b5eef9b17fe98e146d8311a541e upstream.

    PMD faults on a zero length file on a file system mounted with -o dax
    will not generate SIGBUS as expected.

    fd = open(...O_TRUNC);
    addr = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    *addr = 'a';

    The problem is this code in dax_iomap_pmd_fault:

    max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

    If the inode size is zero, we end up with a max_pgoff that is way larger
    than 0. :) Fix it by using DIV_ROUND_UP, as is done elsewhere in the
    kernel.

    I tested this with some simple test code that ensured that SIGBUS was
    received where expected.

    Fixes: 642261ac995e ("dax: add struct iomap based DAX PMD support")
    Signed-off-by: Jeff Moyer
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     

15 Sep, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Some request-based DM core and DM multipath fixes and cleanups

    - Constify a few variables in DM core and DM integrity

    - Add bufio optimization and checksum failure accounting to DM
    integrity

    - Fix DM integrity to avoid checking integrity of failed reads

    - Fix DM integrity to use init_completion

    - A couple DM log-writes target fixes

    - Simplify DAX flushing by eliminating the unnecessary flush
    abstraction that was stood up for DM's use.

    * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dax: remove the pmem_dax_ops->flush abstraction
    dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
    dm integrity: make blk_integrity_profile structure const
    dm integrity: do not check integrity for failed read operations
    dm log writes: fix >512b sectorsize support
    dm log writes: don't use all the cpu while waiting to log blocks
    dm ioctl: constify ioctl lookup table
    dm: constify argument arrays
    dm integrity: count and display checksum failures
    dm integrity: optimize writing dm-bufio buffers that are partially changed
    dm rq: do not update rq partially in each ending bio
    dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
    dm mpath: complain about unsupported __multipath_map_bio() return values
    dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through

    Linus Torvalds
     

11 Sep, 2017

1 commit

  • Commit abebfbe2f731 ("dm: add ->flush() dax operation support") is
    buggy. A DM device may be composed of multiple underlying devices and
    all of them need to be flushed. That commit just routes the flush
    request to the first device and ignores the other devices.

    It could be fixed by adding more complex logic to the device mapper. But
    there is only one implementation of the method pmem_dax_ops->flush - that
    is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
    don't need the pmem_dax_ops->flush abstraction at all, we can call
    arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
    can't ever reach anything different from arch_wb_cache_pmem().

    It should be also pointed out that for some uses of persistent memory it
    is needed to flush only a very small amount of data (such as 1 cacheline),
    and it would be overkill if we go through that device mapper machinery for
    a single flushed cache line.

    Fix this by removing the pmem_dax_ops->flush abstraction and call
    arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
    mapper code that forwards the flushes.

    Fixes: abebfbe2f731 ("dm: add ->flush() dax operation support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Dan Williams
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

07 Sep, 2017

7 commits

  • dax_pmd_insert_mapping() contains the following code:

    pfn_t pfn;
    if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
    goto fallback;
    /* ... */
    fallback:
    trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);

    When the condition in the if statement fails, the function calls
    trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.

    This issue has been found while building the kernel with clang. The
    compiler reported:

    fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
    whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
    if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fs/dax.c:1310:60: note: uninitialized use occurs here
    trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
    ^~~

    Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Nicolas Iooss
    Reviewed-by: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
    equivalent page offset mask.

    Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "Slusarz, Marcin"
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add a comment explaining how the user addresses provided to read(2) and
    write(2) are validated in the DAX I/O path.

    We call dax_copy_from_iter() or copy_to_iter() on these without calling
    access_ok() first in the DAX code, and there was a concern that the user
    might be able to read/write to arbitrary kernel addresses with this
    path.

    Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Now that we no longer insert struct page pointers in DAX radix trees the
    page cache code no longer needs to know anything about DAX exceptional
    entries. Move all the DAX exceptional entry definitions from dax.h to
    fs/dax.c.

    Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Now that we no longer insert struct page pointers in DAX radix trees we
    can remove the special casing for DAX in page_cache_tree_insert().

    This also allows us to make dax_wake_mapping_entry_waiter() local to
    fs/dax.c, removing it from dax.h.

    Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • When servicing mmap() reads from file holes the current DAX code
    allocates a page cache page of all zeroes and places the struct page
    pointer in the mapping->page_tree radix tree.

    This has three major drawbacks:

    1) It consumes memory unnecessarily. For every 4k page that is read via
    a DAX mmap() over a hole, we allocate a new page cache page. This
    means that if you read 1GiB worth of pages, you end up using 1GiB of
    zeroed memory. This is easily visible by looking at the overall
    memory consumption of the system or by looking at /proc/[pid]/smaps:

    7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
    Size: 1048576 kB
    Rss: 1048576 kB
    Pss: 1048576 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 1048576 kB
    Private_Dirty: 0 kB
    Referenced: 1048576 kB
    Anonymous: 0 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Locked: 0 kB

    2) It is slower than using a common zero page because each page fault
    has more work to do. Instead of just inserting a common zero page we
    have to allocate a page cache page, zero it, and then insert it. Here
    are the average latencies of dax_load_hole() as measured by ftrace on
    a random test box:

    Old method, using zeroed page cache pages: 3.4 us
    New method, using the common 4k zero page: 0.8 us

    This was the average latency over 1 GiB of sequential reads done by
    this simple fio script:

    [global]
    size=1G
    filename=/root/dax/data
    fallocate=none
    [io]
    rw=read
    ioengine=mmap

    3) The fact that we had to check for both DAX exceptional entries and
    for page cache pages in the radix tree made the DAX code more
    complex.

    Solve these issues by following the lead of the DAX PMD code and using a
    common 4k zero page instead. As with the PMD code we will now insert a
    DAX exceptional entry into the radix tree instead of a struct page
    pointer which allows us to remove all the special casing in the DAX
    code.

    Note that we do still pretty aggressively check for regular pages in the
    DAX radix tree, especially where we take action based on the bits set in
    the page. If we ever find a regular page in our radix tree now that
    most likely means that someone besides DAX is inserting pages (which has
    happened lots of times in the past), and we want to find that out early
    and fail loudly.

    This solution also removes the extra memory consumption. Here is that
    same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
    code:

    7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
    Size: 1048576 kB
    Rss: 0 kB
    Pss: 0 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 0 kB
    Referenced: 0 kB
    Anonymous: 0 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Locked: 0 kB

    Overall system memory consumption is similarly improved.

    Another major change is that we remove dax_pfn_mkwrite() from our fault
    flow, and instead rely on the page fault itself to make the PTE dirty
    and writeable. The following description from the patch adding the
    vm_insert_mixed_mkwrite() call explains this a little more:

    "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

    case 1: 4k zero page => writable DAX storage
    case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

    Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
    needs to be moved lower in dax.c so the definition exists.

    dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
    made static to dax.c, so we need to move its definition above all its
    callers.

    Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

01 Sep, 2017

1 commit

  • Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
    and make sure it is bracketed by calls to *_invalidate_range_start()/end().

    Note that because we can not presume the pmd value or pte value we have
    to assume the worst and unconditionaly report an invalidation as
    happening.

    Signed-off-by: Jérôme Glisse
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Bernhard Held
    Cc: Adam Borowski
    Cc: Andrea Arcangeli
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

26 Aug, 2017

1 commit

  • In DAX there are two separate places where the 2MiB range of a PMD is
    defined.

    The first is in the page tables, where a PMD mapping inserted for a
    given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
    PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB boundary below the
    address to the 2MiB boundary above the address.

    So, for example, a fault at address 3MiB (0x30 0000) falls within the
    PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).

    The second PMD range is in the mapping->page_tree, where a given file
    offset is covered by a radix tree entry that spans from one 2MiB aligned
    file offset to another 2MiB aligned file offset.

    So, for example, the file offset for 3MiB (pgoff 768) falls within the
    PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
    512) to 4MiB (pgoff 1024).

    This system works so long as the addresses and file offsets for a given
    mapping both have the same offsets relative to the start of each PMD.

    Consider the case where the starting address for a given file isn't 2MiB
    aligned - say our faulting address is 3 MiB (0x30 0000), but that
    corresponds to the beginning of our file (pgoff 0). Now all the PMDs in
    the mapping are misaligned so that the 2MiB range defined in the page
    tables never matches up with the 2MiB range defined in the radix tree.

    The current code notices this case for DAX faults to storage with the
    following test in dax_pmd_insert_mapping():

    if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
    goto unlock_fallback;

    This test makes sure that the pfn we get from the driver is 2MiB
    aligned, and relies on the assumption that the 2MiB alignment of the pfn
    we get back from the driver matches the 2MiB alignment of the faulting
    address.

    However, faults to holes were not checked and we could hit the problem
    described above.

    This was reported in response to the NVML nvml/src/test/pmempool_sync
    TEST5:

    $ cd nvml/src/test/pmempool_sync
    $ make TEST5

    You can grab NVML here:

    https://github.com/pmem/nvml/

    The dmesg warning you see when you hit this error is:

    WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310

    Where we notice in dax_insert_mapping_entry() that the radix tree entry
    we are about to replace doesn't match the locked entry that we had
    previously inserted into the tree. This happens because the initial
    insertion was done in grab_mapping_entry() using a pgoff calculated from
    the faulting address (vmf->address), and the replacement in
    dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
    vmf->pgoff.

    In our failure case those two page offsets (one calculated from
    vmf->address, one using vmf->pgoff) point to different order 9 radix
    tree entries.

    This failure case can result in a deadlock because the radix tree unlock
    also happens on the pgoff calculated from vmf->address. This means that
    the locked radix tree entry that we swapped in to the tree in
    dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
    future faults to that 2MiB range will block forever.

    Fix this by validating that the faulting address's PMD offset matches
    the PMD offset from the start of the file. This check is done at the
    very beginning of the fault and covers faults that would have mapped to
    storage as well as faults to holes. I left the COLOUR check in
    dax_pmd_insert_mapping() in place in case we ever hit the insanity
    condition where the alignment of the pfn we get from the driver doesn't
    match the alignment of the userspace address.

    Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: "Slusarz, Marcin"
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

08 Jul, 2017

2 commits

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:
    "libnvdimm updates for the latest ACPI and UEFI specifications. This
    pull request also includes new 'struct dax_operations' enabling to
    undo the abuse of copy_user_nocache() for copy operations to pmem.

    The dax work originally missed 4.12 to address concerns raised by Al.

    Summary:

    - Introduce the _flushcache() family of memory copy helpers and use
    them for persistent memory write operations on x86. The
    _flushcache() semantic indicates that the cache is either bypassed
    for the copy operation (movnt) or any lines dirtied by the copy
    operation are written back (clwb, clflushopt, or clflush).

    - Extend dax_operations with ->copy_from_iter() and ->flush()
    operations. These operations and other infrastructure updates allow
    all persistent memory specific dax functionality to be pushed into
    libnvdimm and the pmem driver directly. It also allows dax-specific
    sysfs attributes to be linked to a host device, for example:
    /sys/block/pmem0/dax/write_cache

    - Add support for the new NVDIMM platform/firmware mechanisms
    introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
    namespace label format, extensions to the address-range-scrub
    command set, new error injection commands, and a new BTT
    (block-translation-table) layout. These updates support inter-OS
    and pre-OS compatibility.

    - Fix a longstanding memory corruption bug in nfit_test.

    - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
    capable.

    - Miscellaneous fixes and small updates across libnvdimm and the nfit
    driver.

    Acknowledgements that came after the branch was pushed: commit
    6aa734a2f38e ("libnvdimm, region, pmem: fix 'badblocks'
    sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
    "

    * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
    libnvdimm, namespace: record 'lbasize' for pmem namespaces
    acpi/nfit: Issue Start ARS to retrieve existing records
    libnvdimm: New ACPI 6.2 DSM functions
    acpi, nfit: Show bus_dsm_mask in sysfs
    libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
    acpi, nfit: Enable DSM pass thru for root functions.
    libnvdimm: passthru functions clear to send
    libnvdimm, btt: convert some info messages to warn/err
    libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
    libnvdimm: fix the clear-error check in nsio_rw_bytes
    libnvdimm, btt: fix btt_rw_page not returning errors
    acpi, nfit: quiet invalid block-aperture-region warnings
    libnvdimm, btt: BTT updates for UEFI 2.7 format
    acpi, nfit: constify *_attribute_group
    libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
    libnvdimm, pmem, dax: export a cache control attribute
    dax: convert to bitmask for flags
    dax: remove default copy_from_iter fallback
    libnvdimm, nfit: enable support for volatile ranges
    libnvdimm, pmem: fix persistence warning
    ...

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

06 Jul, 2017

1 commit

  • Jan Kara's description for this patch is much better than mine, so I'm
    quoting it verbatim here:

    DAX currently doesn't set errors in the mapping when cache flushing
    fails in dax_writeback_mapping_range(). Since this function can get
    called only from fsync(2) or sync(2), this is actually as good as it can
    currently get since we correctly propagate the error up from
    dax_writeback_mapping_range() to filemap_fdatawrite()

    However, in the future better writeback error handling will enable us to
    properly report these errors on fsync(2) even if there are multiple file
    descriptors open against the file or if sync(2) gets called before
    fsync(2). So convert DAX to using standard error reporting through the
    mapping.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Reviewed-and-tested-by: Ross Zwisler

    Jeff Layton
     

28 Jun, 2017

1 commit

  • Now that all callers of the pmem api have been converted to dax helpers that
    call back to the pmem driver, we can remove include/linux/pmem.h and
    asm/pmem.h.

    Cc:
    Cc: Jeff Moyer
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Toshi Kani
    Cc: Oliver O'Halloran
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

24 Jun, 2017

2 commits


20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Jun, 2017

3 commits

  • The clear_pmem() helper simply combines a memset() plus a cache flush.
    Now that the flush routine is optionally provided by the dax device
    driver we can avoid unnecessary cache management on dax devices fronting
    volatile memory.

    With clear_pmem() gone we can follow on with a patch to make pmem cache
    management completely defined within the pmem driver.

    Cc:
    Cc: Jeff Moyer
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Filesystem-DAX flushes caches whenever it writes to the address returned
    through dax_direct_access() and when writing back dirty radix entries.
    That flushing is only required in the pmem case, so the dax_flush()
    helper skips cache management work when the underlying driver does not
    specify a flush method.

    We still do all the dirty tracking since the radix entry will already be
    there for locking purposes. However, the work to clean the entry will be
    a nop for some dax drivers.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Now that all possible providers of the dax_operations copy_from_iter
    method are implemented, switch filesytem-dax to call the driver rather
    than copy_to_iter_pmem.

    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

03 Jun, 2017

1 commit

  • We currently have two related PMD vs PTE races in the DAX code. These
    can both be easily triggered by having two threads reading and writing
    simultaneously to the same private mapping, with the key being that
    private mapping reads can be handled with PMDs but private mapping
    writes are always handled with PTEs so that we can COW.

    Here is the first race:

    CPU 0 CPU 1

    (private mapping write)
    __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
    passes check for pmd_devmap()

    (private mapping read)
    __handle_mm_fault()
    create_huge_pmd()
    dax_iomap_pmd_fault() inserts PMD

    dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
    installed in our page tables at this spot.

    Here's the second race:

    CPU 0 CPU 1

    (private mapping read)
    __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
    dax_iomap_pmd_fault() inserts PMD

    (private mapping write)
    __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    (private mapping read)
    __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()

    handle_pte_fault()
    dax_iomap_pte_fault() inserts PTE
    dax_iomap_pmd_fault() inserts PMD,
    but we already have a PTE at
    this spot.

    The core of the issue is that while there is isolation between faults to
    the same range in the DAX fault handlers via our DAX entry locking,
    there is no isolation between faults in the code in mm/memory.c. This
    means for instance that this code in __handle_mm_fault() can run:

    if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
    ret = create_huge_pmd(&vmf);

    But by the time we actually get to run the fault handler called by
    create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
    fault has installed a normal PMD here as a parent. This is the cause of
    the 2nd race. The first race is similar - there is the following check
    in handle_pte_fault():

    } else {
    /* See comment in pte_alloc_one_map() */
    if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
    return 0;

    So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
    will bail and retry the fault. This is correct, but there is nothing
    preventing the PMD from being installed after this check but before we
    actually get to the DAX PTE fault handlers.

    In my testing these races result in the following types of errors:

    BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
    BUG: non-zero nr_ptes on freeing mm: 15

    Fix this issue by having the DAX fault handlers verify that it is safe
    to continue their fault after they have taken an entry lock to block
    other racing faults.

    [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
    Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
    Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Pawel Lebioda
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: "Kirill A . Shutemov"
    Cc: Pawel Lebioda
    Cc: Dave Jiang
    Cc: Xiong Zhou
    Cc: Eryu Guan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

14 May, 2017

1 commit

  • Merge misc fixes from Andrew Morton:
    "15 fixes"

    * emailed patches from Andrew Morton :
    mm, docs: update memory.stat description with workingset* entries
    mm: vmscan: scan until it finds eligible pages
    mm, thp: copying user pages must schedule on collapse
    dax: fix PMD data corruption when fault races with write
    dax: fix data corruption when fault races with write
    ext4: return to starting transaction in ext4_dax_huge_fault()
    mm: fix data corruption due to stale mmap reads
    dax: prevent invalidation of mapped DAX entries
    Tigran has moved
    mm, vmalloc: fix vmalloc users tracking properly
    mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
    gcov: support GCC 7.1
    mm, vmstat: Remove spurious WARN() during zoneinfo print
    time: delete current_fs_time()
    hwpoison, memcg: forcibly uncharge LRU pages

    Linus Torvalds
     

13 May, 2017

5 commits

  • This is based on a patch from Jan Kara that fixed the equivalent race in
    the DAX PTE fault path.

    Currently DAX PMD read fault can race with write(2) in the following
    way:

    CPU1 - write(2) CPU2 - read fault
    dax_iomap_pmd_fault()
    ->iomap_begin() - sees hole

    dax_iomap_rw()
    iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
    invalidate_inode_pages2_range()
    - there's nothing to invalidate

    grab_mapping_entry()
    - we add huge zero page to the radix tree
    and map it to page tables

    The result is that hole page is mapped into page tables (and thus zeros
    are seen in mmap) while file has data written in that place.

    Fix the problem by locking exception entry before mapping blocks for the
    fault. That way we are sure invalidate_inode_pages2_range() call for
    racing write will either block on entry lock waiting for the fault to
    finish (and unmap stale page tables after that) or read fault will see
    already allocated blocks by write(2).

    Fixes: 9f141d6ef6258 ("dax: Call ->iomap_begin without entry lock during dax fault")
    Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Currently DAX read fault can race with write(2) in the following way:

    CPU1 - write(2) CPU2 - read fault
    dax_iomap_pte_fault()
    ->iomap_begin() - sees hole
    dax_iomap_rw()
    iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
    invalidate_inode_pages2_range()
    - there's nothing to invalidate
    grab_mapping_entry()
    - we add zero page in the radix tree
    and map it to page tables

    The result is that hole page is mapped into page tables (and thus zeros
    are seen in mmap) while file has data written in that place.

    Fix the problem by locking exception entry before mapping blocks for the
    fault. That way we are sure invalidate_inode_pages2_range() call for
    racing write will either block on entry lock waiting for the fault to
    finish (and unmap stale page tables after that) or read fault will see
    already allocated blocks by write(2).

    Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
    Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently, we didn't invalidate page tables during invalidate_inode_pages2()
    for DAX. That could result in e.g. 2MiB zero page being mapped into
    page tables while there were already underlying blocks allocated and
    thus data seen through mmap were different from data seen by read(2).
    The following sequence reproduces the problem:

    - open an mmap over a 2MiB hole

    - read from a 2MiB hole, faulting in a 2MiB zero page

    - write to the hole with write(3p). The write succeeds but we
    incorrectly leave the 2MiB zero page mapping intact.

    - via the mmap, read the data that was just written. Since the zero
    page mapping is still intact we read back zeroes instead of the new
    data.

    Fix the problem by unconditionally calling invalidate_inode_pages2_range()
    in dax_iomap_actor() for new block allocations and by properly
    invalidating page tables in invalidate_inode_pages2_range() for DAX
    mappings.

    Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
    Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
    v4.

    This series fixes data corruption that can happen for DAX mounts when
    page faults race with write(2) and as a result page tables get out of
    sync with block mappings in the filesystem and thus data seen through
    mmap is different from data seen through read(2).

    The series passes testing with t_mmap_stale test program from Ross and
    also other mmap related tests on DAX filesystem.

    This patch (of 4):

    dax_invalidate_mapping_entry() currently removes DAX exceptional entries
    only if they are clean and unlocked. This is done via:

    invalidate_mapping_pages()
    invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

    However, for page cache pages removed in invalidate_mapping_pages()
    there is an additional criteria which is that the page must not be
    mapped. This is noted in the comments above invalidate_mapping_pages()
    and is checked in invalidate_inode_page().

    For DAX entries this means that we can can end up in a situation where a
    DAX exceptional entry, either a huge zero page or a regular DAX entry,
    could end up mapped but without an associated radix tree entry. This is
    inconsistent with the rest of the DAX code and with what happens in the
    page cache case.

    We aren't able to unmap the DAX exceptional entry because according to
    its comments invalidate_mapping_pages() isn't allowed to block, and
    unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

    Since we essentially never have unmapped DAX entries to evict from the
    radix tree, just remove dax_invalidate_mapping_entry().

    Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
    Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Reported-by: Jan Kara
    Cc: Dan Williams
    Cc: [4.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

11 May, 2017

1 commit

  • The conversion of __dax_zero_page_range() to 'struct dax_operations'
    caused it to frequently fail. The mistake was treating the @size
    parameter as a dax mapping length rather than just a length of the
    clear_pmem() operation. The dax mapping length is assumed to be hard
    coded as PAGE_SIZE.

    Without this fix any page unaligned zeroing request will trigger a
    -EINVAL return from bdev_dax_pgoff().

    Cc: Jan Kara
    Cc: Christoph Hellwig
    Reported-by: Ross Zwisler
    Tested-by: Ross Zwisler
    Fixes: cccbce671582 ("filesystem-dax: convert to dax_direct_access()")
    Signed-off-by: Dan Williams

    Dan Williams
     

09 May, 2017

6 commits

  • Add a tracepoint to dax_insert_mapping(), following the same logging
    conventions as the rest of DAX. This tracepoint, along with the one in
    dax_load_hole(), lets us know how a DAX PTE fault was serviced.

    Here is an example DAX fault that inserts a PTE mapping:

    small-1126 [007] ....
    145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

    small-1126 [007] ....
    145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006

    small-1126 [007] ....
    145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add a tracepoint to dax_writeback_one(), following the same logging
    conventions as the rest of DAX.

    Here is an example range writeback which ends up flushing one PMD and
    one PTE:

    test-1265 [003] ....
    496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

    test-1265 [003] ....
    496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200

    test-1265 [003] ....
    496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1

    test-1265 [003] ....
    496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

    [akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
    Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add tracepoints to dax_writeback_mapping_range(), following the same
    logging conventions as the rest of DAX.

    Here is an example writeback call:

    msync-1085 [006] ....
    200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

    msync-1085 [006] ....
    200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

    [ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
    Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
    Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add tracepoints to dax_load_hole(), following the same logging conventions
    as the rest of DAX.

    Here is the logging generated by a PTE read from a hole:

    read-1075 [002] ....
    62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280

    read-1075 [002] ....
    62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

    read-1075 [002] ....
    62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add tracepoints to dax_pfn_mkwrite(), following the same logging
    conventions as the rest of DAX.

    Here is an example PTE fault followed by a pfn_mkwrite:

    small_aligned-1094 [002] ....
    374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200

    small_aligned-1094 [002] ....
    374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE

    small_aligned-1094 [002] ....
    374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Patch series "second round of tracepoints for DAX".

    This second round of DAX tracepoint patches adds tracing to the PTE
    fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
    dax_insert_mapping()) and to the writeback path
    (dax_writeback_mapping_range(), dax_writeback_one()).

    The purpose of this tracing is to give us a high level view of what DAX
    is doing, whether faults are being serviced by PMDs or PTEs, and by real
    storage or by zero pages covering holes.

    I do have some patches nearly ready which also add tracing to
    grab_mapping_entry() and dax_insert_mapping_entry(). These are more
    targeted at logging how we are interacting with the radix tree, how we
    use empty entries for locking, whether we "downgrade" huge zero pages to
    4k PTE sized allocations, etc. In the end it seemed to me that this
    might be too detailed to have as constantly present tracepoints, but if
    anyone sees value in having tracepoints like this in the DAX code
    permanently (Jan?), please let me know and I'll add those last two
    patches.

    All these tracepoints were done to be consistent with the style of the
    XFS tracepoints and with the existing DAX PMD tracepoints.

    This patch (of 6):

    Add tracepoints to dax_iomap_pte_fault(), following the same logging
    conventions as the rest of DAX.

    Here is an example fault that initially tries to be serviced by the PMD
    fault handler but which falls back to PTEs because the VMA isn't large
    enough to hold a PMD:

    small-1086 [005] ....
    71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003

    small-1086 [005] ....
    71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400

    small-1086 [005] ....
    71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK

    small-1086 [005] ....
    71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

    small-1086 [005] ....
    71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

02 May, 2017

1 commit

  • Pull block layer updates from Jens Axboe:

    - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
    was initially a fork of CFQ, but subsequently changed to implement
    fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
    to be used on desktop type single drives, providing good fairness.
    From Paolo.

    - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
    using a scalable token based algorithm that throttles IO based on
    live completion IO stats, similary to blk-wbt. From Omar.

    - A series from Jan, moving users to separately allocated backing
    devices. This continues the work of separating backing device life
    times, solving various problems with hot removal.

    - A series of updates for lightnvm, mostly from Javier. Includes a
    'pblk' target that exposes an open channel SSD as a physical block
    device.

    - A series of fixes and improvements for nbd from Josef.

    - A series from Omar, removing queue sharing between devices on mostly
    legacy drivers. This helps us clean up other bits, if we know that a
    queue only has a single device backing. This has been overdue for
    more than a decade.

    - Fixes for the blk-stats, and improvements to unify the stats and user
    windows. This both improves blk-wbt, and enables other users to
    register a need to receive IO stats for a device. From Omar.

    - blk-throttle improvements from Shaohua. This provides a scalable
    framework for implementing scalable priotization - particularly for
    blk-mq, but applicable to any type of block device. The interface is
    marked experimental for now.

    - Bucketized IO stats for IO polling from Stephen Bates. This improves
    efficiency of polled workloads in the presence of mixed block size
    IO.

    - A few fixes for opal, from Scott.

    - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
    From a variety of folks, mostly Sagi and James Smart.

    - A series from Bart, improving our exposed info and capabilities from
    the blk-mq debugfs support.

    - A series from Christoph, cleaning up how handle WRITE_ZEROES.

    - A series from Christoph, cleaning up the block layer handling of how
    we track errors in a request. On top of being a nice cleanup, it also
    shrinks the size of struct request a bit.

    - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
    never used by platforms, and the latter has outlived it's usefulness.

    - Various little bug fixes and cleanups from a wide variety of folks.

    * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
    block: hide badblocks attribute by default
    blk-mq: unify hctx delay_work and run_work
    block: add kblock_mod_delayed_work_on()
    blk-mq: unify hctx delayed_run_work and run_work
    nbd: fix use after free on module unload
    MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
    blk-mq-sched: alloate reserved tags out of normal pool
    mtip32xx: use runtime tag to initialize command header
    scsi: Implement blk_mq_ops.show_rq()
    blk-mq: Add blk_mq_ops.show_rq()
    blk-mq: Show operation, cmd_flags and rq_flags names
    blk-mq: Make blk_flags_show() callers append a newline character
    blk-mq: Move the "state" debugfs attribute one level down
    blk-mq: Unregister debugfs attributes earlier
    blk-mq: Only unregister hctxs for which registration succeeded
    blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
    blk-mq: Let blk_mq_debugfs_register() look up the queue name
    blk-mq: Register /queue/mq after having registered /queue
    ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
    ide-pm: always pass 0 error to __blk_end_request_all
    ..

    Linus Torvalds