09 Feb, 2017

1 commit

  • commit d1908f52557b3230fbd63c0429f3b4b748bf2b6d upstream.

    Tetsuo has noticed that an OOM stress test which performs large write
    requests can cause the full memory reserves depletion. He has tracked
    this down to the following path

    __alloc_pages_nodemask+0x436/0x4d0
    alloc_pages_current+0x97/0x1b0
    __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728
    pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331
    grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773
    iomap_write_begin+0x50/0xd0 fs/iomap.c:118
    iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190
    ? iomap_write_end+0x80/0x80 fs/iomap.c:150
    iomap_apply+0xb3/0x130 fs/iomap.c:79
    iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
    ? iomap_write_end+0x80/0x80
    xfs_file_buffered_aio_write+0x132/0x390 [xfs]
    ? remove_wait_queue+0x59/0x60
    xfs_file_write_iter+0x90/0x130 [xfs]
    __vfs_write+0xe5/0x140
    vfs_write+0xc7/0x1f0
    ? syscall_trace_enter+0x1d0/0x380
    SyS_write+0x58/0xc0
    do_syscall_64+0x6c/0x200
    entry_SYSCALL64_slow_path+0x25/0x25

    the oom victim has access to all memory reserves to make a forward
    progress to exit easier. But iomap_file_buffered_write and other
    callers of iomap_apply loop to complete the full request. We need to
    check for fatal signals and back off with a short write instead.

    As the iomap_apply delegates all the work down to the actor we have to
    hook into those. All callers that work with the page cache are calling
    iomap_write_begin so we will check for signals there. dax_iomap_actor
    has to handle the situation explicitly because it copies data to the
    userspace directly. Other callers like iomap_page_mkwrite work on a
    single page or iomap_fiemap_actor do not allocate memory based on the
    given len.

    Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
    Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

08 Oct, 2016

1 commit

  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

19 Sep, 2016

4 commits


29 Jul, 2016

1 commit

  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     

27 Jul, 2016

1 commit

  • Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
    removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
    dax_pmd_fault() respectively, and update all callers.

    The dax_fault() and dax_pmd_fault() wrappers were initially intended to
    capture some filesystem independent functionality around page faults
    (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
    and ctime).

    However, the following commits:

    5726b27b09cc ("ext2: Add locking for DAX faults")
    ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

    added locking to the ext2 and ext4 filesystems after these common
    operations but before __dax_fault() and __dax_pmd_fault() were called.
    This means that these wrappers are no longer used, and are unlikely to
    be used in the future.

    XFS has had locking analogous to what was recently added to ext2 and
    ext4 since DAX support was initially introduced by:

    6b698edeeef0 ("xfs: add DAX file operations support")

    Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dan Williams
    Cc: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

24 Jul, 2016

1 commit


13 Jul, 2016

2 commits

  • The __pmem address space was meant to annotate codepaths that touch
    persistent memory and need to coordinate a call to wmb_pmem(). Now that
    wmb_pmem() is gone, there is little need to keep this annotation.

    Cc: Christoph Hellwig
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Flushing posted-write queues is now deferred to REQ_FLUSH context, or
    otherwise handled by an ADR event at the platform level.

    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

28 Jun, 2016

1 commit

  • This isn't functionally apparent for some reason, but
    when we test io at extreme offsets at the end of the loff_t
    rang, such as in fstests xfs/071, the calculation of
    "max" in dax_io() can be wrong due to pos + size overflowing.

    For example,

    # xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file

    enters dax_io with:

    start 0x7ffffffffffff000
    end 0x7ffffffffffff200

    and the rounded up "size" variable is 0x1000. This yields:

    pos + size 0x8000000000000000 (overflows loff_t)
    end 0x7ffffffffffff200

    Due to the overflow, the min() function picks the wrong
    value for the "max" variable, and when we send (max - pos)
    into i.e. copy_from_iter_pmem() it is also the wrong value.

    This somehow(tm) gets magically absorbed without incident,
    probably because iter->count is correct. But it seems best
    to fix it up properly by comparing the two values as
    unsigned.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Dan Williams

    Eric Sandeen
     

27 May, 2016

2 commits

  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     
  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

25 May, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Fix a number of bugs, most notably a potential stale data exposure
    after a crash and a potential BUG_ON crash if a file has the data
    journalling flag enabled while it has dirty delayed allocation blocks
    that haven't been written yet. Also fix a potential crash in the new
    project quota code and a maliciously corrupted file system.

    In addition, fix some DAX-specific bugs, including when there is a
    transient ENOSPC situation and races between writes via direct I/O and
    an mmap'ed segment that could lead to lost I/O.

    Finally the usual set of miscellaneous cleanups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
    ext4: pre-zero allocated blocks for DAX IO
    ext4: refactor direct IO code
    ext4: fix race in transient ENOSPC detection
    ext4: handle transient ENOSPC properly for DAX
    dax: call get_blocks() with create == 1 for write faults to unwritten extents
    ext4: remove unmeetable inconsisteny check from ext4_find_extent()
    jbd2: remove excess descriptions for handle_s
    ext4: remove unnecessary bio get/put
    ext4: silence UBSAN in ext4_mb_init()
    ext4: address UBSAN warning in mb_find_order_for_block()
    ext4: fix oops on corrupted filesystem
    ext4: fix check of dqget() return value in ext4_ioctl_setproject()
    ext4: clean up error handling when orphan list is corrupted
    ext4: fix hang when processing corrupted orphaned inode list
    ext4: remove trailing \n from ext4_warning/ext4_error calls
    ext4: fix races between changing inode journal mode and ext4_writepages
    ext4: handle unwritten or delalloc buffers before enabling data journaling
    ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
    ext4: do not ask jbd2 to write data for delalloc buffers
    jbd2: add support for avoiding data writes during transaction commits
    ...

    Linus Torvalds
     

21 May, 2016

1 commit

  • These don't belong in radix-tree.h any more than PAGECACHE_TAG_* do.
    Let's try to maintain the idea that radix-tree simply implements an
    abstract data type.

    Signed-off-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

20 May, 2016

6 commits

  • Currently faults are protected against truncate by filesystem specific
    i_mmap_sem and page lock in case of hole page. Cow faults are protected
    DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
    code. Remove it.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • When doing cow faults, we cannot directly fill in PTE as we do for other
    faults as we rely on generic code to do proper accounting of the cowed page.
    We also have no page to lock to protect against races with truncate as
    other faults have and we need the protection to extend until the moment
    generic code inserts cowed page into PTE thus at that point we have no
    protection of fs-specific i_mmap_sem. So far we relied on using
    i_mmap_lock for the protection however that is completely special to cow
    faults. To make fault locking more uniform use DAX entry lock instead.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently DAX page fault locking is racy.

    CPU0 (write fault) CPU1 (read fault)

    __dax_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
    get_block(inode, block, &bh, 1) -> allocates blocks
    if (page) -> no
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE) {
    } else {
    dax_load_hole();
    }
    dax_insert_mapping()

    And we are in a situation where we fail in dax_radix_entry() with -EIO.

    Another problem with the current DAX page fault locking is that there is
    no race-free way to clear dirty tag in the radix tree. We can always
    end up with clean radix tree and dirty data in CPU cache.

    We fix the first problem by introducing locking of exceptional radix
    tree entries in DAX mappings acting very similarly to page lock and thus
    synchronizing properly faults against the same mapping index. The same
    lock can later be used to avoid races when clearing radix tree dirty
    tag.

    Reviewed-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • We will use lowest available bit in the radix tree exceptional entry for
    locking of the entry. Define it. Also clean up definitions of DAX entry
    type bits in DAX exceptional entries to use defined constants instead of
    hardcoding numbers and cleanup checking of these bits to not rely on how
    other bits in the entry are set.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently the handling of huge pages for DAX is racy. For example the
    following can happen:

    CPU0 (THP write fault) CPU1 (normal read fault)

    __dax_pmd_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
    truncate_pagecache_range(inode, lstart, lend);
    dax_load_hole();

    This results in data corruption since process on CPU1 won't see changes
    into the file done by CPU0.

    The race can happen even if two normal faults race however with THP the
    situation is even worse because the two faults don't operate on the same
    entries in the radix tree and we want to use these entries for
    serialization. So make THP support in DAX code depend on CONFIG_BROKEN
    for now.

    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
    returned buffer has BH_Uptodate set. However that doesn't get set for
    any mapping buffer so that branch is actually a dead code. The
    BH_Uptodate check doesn't make any sense so just remove it.

    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

19 May, 2016

4 commits

  • The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in

    09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
    macros

    The comments for the above functions described a distinction between
    those, that is now redundant, so remove those paragraphs

    Cc: Kirill A. Shutemov
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Vishal Verma

    Vishal Verma
     
  • In the truncate or hole-punch path in dax, we clear out sub-page ranges.
    If these sub-page ranges are sector aligned and sized, we can do the
    zeroing through the driver instead so that error-clearing is handled
    automatically.

    For sub-sector ranges, we still have to rely on clear_pmem and have the
    possibility of tripping over errors.

    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Vishal Verma

    Vishal Verma
     
  • This allows XFS to perform zeroing using the iomap infrastructure and
    avoid buffer heads.

    Reviewed-by: Jan Kara
    Signed-off-by: Christoph Hellwig
    [vishal: fix conflicts with dax-error-handling]
    Signed-off-by: Vishal Verma

    Christoph Hellwig
     
  • dax_clear_sectors() cannot handle poisoned blocks. These must be
    zeroed using the BIO interface instead. Convert ext2 and XFS to use
    only sb_issue_zerout().

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Matthew Wilcox
    [vishal: Also remove the dax_clear_sectors function entirely]
    Signed-off-by: Vishal Verma

    Matthew Wilcox
     

17 May, 2016

7 commits

  • In preparation for consulting a badblocks list in pmem_direct_access(),
    teach dax_pmd_fault() to fallback rather than fail immediately upon
    encountering an error. The thought being that reducing the span of the
    dax request may avoid the error region.

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Vishal Verma

    Dan Williams
     
  • Callers of dax fault handlers must make sure these calls cannot race
    with truncate. Thus it is enough to check inode size when entering the
    function and we don't have to recheck it again later in the handler.
    Note that inode size itself can be decreased while the fault handler
    runs but filesystem locking prevents against any radix tree or block
    mapping information changes resulting from the truncate and that is what
    we really care about.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
    set. Presumably this was copied over from direct IO code. However DAX
    inodes have no pagecache pages to write so the call is pointless. Remove
    it.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • All the filesystems are now zeroing blocks themselves for DAX IO to avoid
    races between dax_io() and dax_fault(). Remove the zeroing code from
    dax_io() and add warning to catch the case when somebody unexpectedly
    returns new or unwritten buffer.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • Now that all filesystems zero out blocks allocated for a fault handler,
    we can just remove the zeroing from the handler itself. Also add checks
    that no filesystem returns to us unwritten or new buffer.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • Fault handlers currently take complete_unwritten argument to convert
    unwritten extents after PTEs are updated. However no filesystem uses
    this anymore as the code is racy. Remove the unused argument.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
    Let's try to maintain the idea that radix-tree simply implements an
    abstract data type.

    Acked-by: Ross Zwisler
    Reviewed-by: Matthew Wilcox
    Signed-off-by: NeilBrown
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    NeilBrown
     

13 May, 2016

1 commit

  • Currently, __dax_fault() does not call get_blocks() callback with create
    argument set, when we got back unwritten extent from the initial
    get_blocks() call during a write fault. This is because originally
    filesystems were supposed to convert unwritten extents to written ones
    using complete_unwritten() callback. Later this was abandoned in favor of
    using pre-zeroed blocks however the condition whether get_blocks() needs
    to be called with create == 1 remained.

    Fix the condition so that filesystems are not forced to zero-out and
    convert unwritten extents when get_blocks() is called with create == 0
    (which introduces unnecessary overhead for read faults and can be
    problematic as the filesystem may possibly be read-only).

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

02 May, 2016

1 commit


05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Mar, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

10 Mar, 2016

1 commit

  • dax_pfn_mkwrite() previously wasn't checking the return value of the
    call to dax_radix_entry(), which was a mistake.

    Instead, capture this return value and return the appropriate VM_FAULT_
    value.

    Signed-off-by: Ross Zwisler
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

28 Feb, 2016

1 commit

  • Previously calls to dax_writeback_mapping_range() for all DAX filesystems
    (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

    dax_writeback_mapping_range() needs a struct block_device, and it used
    to get that from inode->i_sb->s_bdev. This is correct for normal inodes
    mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
    block devices and for XFS real-time files.

    Instead, call dax_writeback_mapping_range() directly from the filesystem
    ->writepages function so that it can supply us with a valid block
    device. This also fixes DAX code to properly flush caches in response
    to sync(2).

    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Cc: Al Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler