19 Sep, 2016

2 commits

  • Very similar to the existing dax_fault function, but instead of using
    the get_block callback we rely on the iomap_ops vector from iomap.c.
    That also avoids having to do two calls into the file system for write
    faults.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • This is a much simpler implementation of the DAX read/write path
    that makes use of the iomap infrastructure. It does not try to
    mirror the direct I/O calling conventions and thus doesn't have to
    deal with i_dio_count or the end_io handler, but instead leaves
    locking and filesystem-specific I/O completion to the caller.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

27 Jul, 2016

1 commit

  • Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
    removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
    dax_pmd_fault() respectively, and update all callers.

    The dax_fault() and dax_pmd_fault() wrappers were initially intended to
    capture some filesystem independent functionality around page faults
    (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
    and ctime).

    However, the following commits:

    5726b27b09cc ("ext2: Add locking for DAX faults")
    ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

    added locking to the ext2 and ext4 filesystems after these common
    operations but before __dax_fault() and __dax_pmd_fault() were called.
    This means that these wrappers are no longer used, and are unlikely to
    be used in the future.

    XFS has had locking analogous to what was recently added to ext2 and
    ext4 since DAX support was initially introduced by:

    6b698edeeef0 ("xfs: add DAX file operations support")

    Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dan Williams
    Cc: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

27 May, 2016

2 commits

  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     
  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

20 May, 2016

5 commits

  • When doing cow faults, we cannot directly fill in PTE as we do for other
    faults as we rely on generic code to do proper accounting of the cowed page.
    We also have no page to lock to protect against races with truncate as
    other faults have and we need the protection to extend until the moment
    generic code inserts cowed page into PTE thus at that point we have no
    protection of fs-specific i_mmap_sem. So far we relied on using
    i_mmap_lock for the protection however that is completely special to cow
    faults. To make fault locking more uniform use DAX entry lock instead.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently DAX page fault locking is racy.

    CPU0 (write fault) CPU1 (read fault)

    __dax_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
    get_block(inode, block, &bh, 1) -> allocates blocks
    if (page) -> no
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE) {
    } else {
    dax_load_hole();
    }
    dax_insert_mapping()

    And we are in a situation where we fail in dax_radix_entry() with -EIO.

    Another problem with the current DAX page fault locking is that there is
    no race-free way to clear dirty tag in the radix tree. We can always
    end up with clean radix tree and dirty data in CPU cache.

    We fix the first problem by introducing locking of exceptional radix
    tree entries in DAX mappings acting very similarly to page lock and thus
    synchronizing properly faults against the same mapping index. The same
    lock can later be used to avoid races when clearing radix tree dirty
    tag.

    Reviewed-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently we forbid page_cache_tree_insert() to replace exceptional radix
    tree entries for DAX inodes. However to make DAX faults race free we will
    lock radix tree entries and when hole is created, we need to replace
    such locked radix tree entry with a hole page. So modify
    page_cache_tree_insert() to allow that.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • We will use lowest available bit in the radix tree exceptional entry for
    locking of the entry. Define it. Also clean up definitions of DAX entry
    type bits in DAX exceptional entries to use defined constants instead of
    hardcoding numbers and cleanup checking of these bits to not rely on how
    other bits in the entry are set.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently the handling of huge pages for DAX is racy. For example the
    following can happen:

    CPU0 (THP write fault) CPU1 (normal read fault)

    __dax_pmd_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
    truncate_pagecache_range(inode, lstart, lend);
    dax_load_hole();

    This results in data corruption since process on CPU1 won't see changes
    into the file done by CPU0.

    The race can happen even if two normal faults race however with THP the
    situation is even worse because the two faults don't operate on the same
    entries in the radix tree and we want to use these entries for
    serialization. So make THP support in DAX code depend on CONFIG_BROKEN
    for now.

    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

19 May, 2016

2 commits


17 May, 2016

1 commit

  • Fault handlers currently take complete_unwritten argument to convert
    unwritten extents after PTEs are updated. However no filesystem uses
    this anymore as the code is racy. Remove the unused argument.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     

02 May, 2016

1 commit


28 Feb, 2016

2 commits

  • Previously calls to dax_writeback_mapping_range() for all DAX filesystems
    (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

    dax_writeback_mapping_range() needs a struct block_device, and it used
    to get that from inode->i_sb->s_bdev. This is correct for normal inodes
    mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
    block devices and for XFS real-time files.

    Instead, call dax_writeback_mapping_range() directly from the filesystem
    ->writepages function so that it can supply us with a valid block
    device. This also fixes DAX code to properly flush caches in response
    to sync(2).

    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Cc: Al Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • dax_clear_blocks() needs a valid struct block_device and previously it
    was using inode->i_sb->s_bdev in all cases. This is correct for normal
    inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
    DAX raw block devices and for XFS real-time devices.

    Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
    its arguments to take a bdev and a sector instead of an inode and a
    block. This better reflects what the function does, and it allows the
    filesystem and raw block device code to pass in an appropriate struct
    block_device.

    Signed-off-by: Ross Zwisler
    Suggested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Theodore Ts'o
    Cc: Al Viro
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

31 Jan, 2016

1 commit

  • Avoid populating pagecache when the block device is in DAX mode.
    Otherwise these page cache entries collide with the fsync/msync
    implementation and break data durability guarantees.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Reported-by: Ross Zwisler
    Tested-by: Ross Zwisler
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Dan Williams

    Dan Williams
     

23 Jan, 2016

2 commits

  • To properly handle fsync/msync in an efficient way DAX needs to track
    dirty pages so it is able to flush them durably to media on demand.

    The tracking of dirty pages is done via the radix tree in struct
    address_space. This radix tree is already used by the page writeback
    infrastructure for tracking dirty pages associated with an open file,
    and it already has support for exceptional (non struct page*) entries.
    We build upon these features to add exceptional entries to the radix
    tree for DAX dirty PMD or PTE pages at fault time.

    [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

09 Sep, 2015

3 commits

  • This is the support code for DAX-enabled filesystems to allow them to
    provide huge pages in response to faults.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use it
    in zap_huge_pmd() and __split_huge_page_pmd().

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
    return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
    defined in linux/mm.h. Given that we don't want to include
    in , the easiest solution is to move the DAX-related
    functions to a new header, . We could also have moved
    VM_FAULT_* definitions to a new header, or a different header that isn't
    quite such a boil-the-ocean header as , but this felt like
    the best option.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox