06 Jan, 2017

1 commit

  • commit 7e6e1ef48fc02f3ac5d0edecbb0c6087cd758d58 upstream.

    Don't load an inode with a negative size; this causes integer overflow
    problems in the VFS.

    [ Added EXT4_ERROR_INODE() to mark file system as corrupted. -TYT]

    Fixes: a48380f769df (ext4: rename i_dir_acl to i_size_high)
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

11 Oct, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

1 commit


30 Sep, 2016

2 commits

  • When zeroing blocks for DAX allocations, we also have to unmap aliases
    in the block device mappings. Otherwise writeback can overwrite zeros
    with stale data from block device page cache.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Jan Kara
     
  • We can easily support parallel direct IO reads. We only have to make
    sure we cannot expose uninitialized data by reading allocated block to
    which data was not written yet, or which was already truncated. That is
    easily achieved by holding inode_lock in shared mode - that excludes all
    writes, truncates, hole punches. We also have to guard against page
    writeback allocating blocks for delay-allocated pages - that race is
    handled by the fact that we writeback all the pages in the affected
    range and the lock protects us from new pages being created there.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

22 Sep, 2016

2 commits

  • Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
    This is because the logic around filemap_write_and_wait_range() in
    ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
    not for dirty DAX exceptional entries.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Theodore Ts'o

    Ross Zwisler
     
  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

15 Sep, 2016

1 commit

  • Pages clear buffers after ext4 delayed block allocation failed,
    However, it does not clean its pte_dirty flag.
    if the pages unmap ,in cording to the pte_dirty ,
    unmap_page_range may try to call __set_page_dirty,

    which may lead to the bugon at
    mpage_prepare_extent_to_map:head = page_buffers(page);.

    This patch just call clear_page_dirty_for_io to clean pte_dirty
    at mpage_release_unused_pages for pages mmaped.

    Steps to reproduce the bug:

    (1) mmap a file in ext4
    addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
    fd, 0);
    memset(addr, 'i', 4096);

    (2) return EIO at

    ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent

    which causes this log message to be print:

    ext4_msg(sb, KERN_CRIT,
    "Delayed block allocation failed for "
    "inode %lu at logical offset %llu with"
    " max blocks %u with error %d",
    inode->i_ino,
    (unsigned long long)map->m_lblk,
    (unsigned)map->m_len, -err);

    (3)Unmap the addr cause warning at

    __set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));

    (4) wait for a minute,then bugon happen.

    Cc: stable@vger.kernel.org
    Signed-off-by: wangguang
    Signed-off-by: Theodore Ts'o

    wangguang
     

06 Sep, 2016

2 commits

  • Use the ext4_{has,set,clear}_feature_* helpers to replace the old
    feature helpers.

    Signed-off-by: Kaho Ng
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong

    Kaho Ng
     
  • Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
    of deleted and evicted inode to fix up interoperability with old
    kernels. However, it checks only i_dtime of an inode to determine
    whether the inode was deleted and evicted, and this is very risky,
    because i_dtime can be used for the pointer maintaining orphan inode
    list, too. We need to further check whether the i_dtime is being
    used for the orphan inode list even if the i_dtime is not NULL.

    We found that high 16-bit fields of uid/gid of inode are unintentionally
    and permanently cleared when the inode truncation is just triggered,
    but not finished, and the inode metadata, whose high uid/gid bits are
    cleared, is written on disk, and the sudden power-off follows that
    in order.

    Cc: stable@vger.kernel.org
    Signed-off-by: Daeho Jeong
    Signed-off-by: Hobin Woo
    Signed-off-by: Theodore Ts'o

    Daeho Jeong
     

30 Aug, 2016

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "Fix bugs that could cause kernel deadlocks or file system corruption
    while moving xattrs to expand the extended inode.

    Also add some sanity checks to the block group descriptors to make
    sure we don't end up overwriting the superblock"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: avoid deadlock when expanding inode size
    ext4: properly align shifted xattrs when expanding inodes
    ext4: fix xattr shifting when expanding inodes part 2
    ext4: fix xattr shifting when expanding inodes
    ext4: validate that metadata blocks do not overlap superblock
    ext4: reserve xattr index for the Hurd

    Linus Torvalds
     

12 Aug, 2016

1 commit

  • When we need to move xattrs into external xattr block, we call
    ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
    up calling ext4_mark_inode_dirty() again which will recurse back into
    the inode expansion code leading to deadlocks.

    Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
    its management into ext4_expand_extra_isize_ea() since its manipulation
    is safe there (due to xattr_sem) from possible races with
    ext4_xattr_set_handle() which plays with it as well.

    CC: stable@vger.kernel.org # 4.4.x
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

27 Jul, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The major change this cycle is deleting ext4's copy of the file system
    encryption code and switching things over to using the copies in
    fs/crypto. I've updated the MAINTAINERS file to add an entry for
    fs/crypto listing Jaeguk Kim and myself as the maintainers.

    There are also a number of bug fixes, most notably for some problems
    found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also
    fixed is a writeback deadlock detected by generic/130, and some
    potential races in the metadata checksum code"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
    ext4: verify extent header depth
    ext4: short-cut orphan cleanup on error
    ext4: fix reference counting bug on block allocation error
    MAINTAINRES: fs-crypto maintainers update
    ext4 crypto: migrate into vfs's crypto engine
    ext2: fix filesystem deadlock while reading corrupted xattr block
    ext4: fix project quota accounting without quota limits enabled
    ext4: validate s_reserved_gdt_blocks on mount
    ext4: remove unused page_idx
    ext4: don't call ext4_should_journal_data() on the journal inode
    ext4: Fix WARN_ON_ONCE in ext4_commit_super()
    ext4: fix deadlock during page writeback
    ext4: correct error value of function verifying dx checksum
    ext4: avoid modifying checksum fields directly during checksum verification
    ext4: check for extents that wrap around
    jbd2: make journal y2038 safe
    jbd2: track more dependencies on transaction commit
    jbd2: move lockdep tracking to journal_s
    jbd2: move lockdep instrumentation for jbd2 handles
    ext4: respect the nobarrier mount option in nojournal mode
    ...

    Linus Torvalds
     

11 Jul, 2016

1 commit


04 Jul, 2016

3 commits

  • If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
    to call ext4_should_journal_data() before superblock options and flags
    are fully set up. In that case, the iput() on the journal inode can
    end up causing a BUG().

    Work around this problem by reordering the tests so we only call
    ext4_should_journal_data() after we know it's not the journal inode.

    Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data")
    Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang")
    Cc: Jan Kara
    Cc: stable@vger.kernel.org
    Signed-off-by: Vegard Nossum
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Vegard Nossum
     
  • Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a
    deadlock in ext4_writepages() which was previously much harder to hit.
    After this commit xfstest generic/130 reproduces the deadlock on small
    filesystems.

    The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
    and marks current inode handle as synchronous. That subsequently results
    in ext4_journal_stop() called from ext4_writepages() to block waiting for
    transaction commit while still holding page locks, reference to io_end,
    and some prepared bio in mpd structure each of which can possibly block
    transaction commit from completing and thus results in deadlock.

    Fix the problem by releasing page locks, io_end reference, and
    submitting prepared bio before calling ext4_journal_stop().

    [ Changed to defer the call to ext4_journal_stop() only if the handle
    is synchronous. --tytso ]

    Reported-and-tested-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     
  • We temporally change checksum fields in buffers of some types of
    metadata into '0' for verifying the checksum values. By doing this
    without locking the buffer, some metadata's checksums, which are
    being committed or written back to the storage, could be damaged.
    In our test, several metadata blocks were found with damaged metadata
    checksum value during recovery process. When we only verify the
    checksum value, we have to avoid modifying checksum fields directly.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Youngjin Gil
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Darrick J. Wong

    Daeho Jeong
     

08 Jun, 2016

2 commits


25 May, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Fix a number of bugs, most notably a potential stale data exposure
    after a crash and a potential BUG_ON crash if a file has the data
    journalling flag enabled while it has dirty delayed allocation blocks
    that haven't been written yet. Also fix a potential crash in the new
    project quota code and a maliciously corrupted file system.

    In addition, fix some DAX-specific bugs, including when there is a
    transient ENOSPC situation and races between writes via direct I/O and
    an mmap'ed segment that could lead to lost I/O.

    Finally the usual set of miscellaneous cleanups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
    ext4: pre-zero allocated blocks for DAX IO
    ext4: refactor direct IO code
    ext4: fix race in transient ENOSPC detection
    ext4: handle transient ENOSPC properly for DAX
    dax: call get_blocks() with create == 1 for write faults to unwritten extents
    ext4: remove unmeetable inconsisteny check from ext4_find_extent()
    jbd2: remove excess descriptions for handle_s
    ext4: remove unnecessary bio get/put
    ext4: silence UBSAN in ext4_mb_init()
    ext4: address UBSAN warning in mb_find_order_for_block()
    ext4: fix oops on corrupted filesystem
    ext4: fix check of dqget() return value in ext4_ioctl_setproject()
    ext4: clean up error handling when orphan list is corrupted
    ext4: fix hang when processing corrupted orphaned inode list
    ext4: remove trailing \n from ext4_warning/ext4_error calls
    ext4: fix races between changing inode journal mode and ext4_writepages
    ext4: handle unwritten or delalloc buffers before enabling data journaling
    ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
    ext4: do not ask jbd2 to write data for delalloc buffers
    jbd2: add support for avoiding data writes during transaction commits
    ...

    Linus Torvalds
     

13 May, 2016

3 commits

  • Currently ext4 treats DAX IO the same way as direct IO. I.e., it
    allocates unwritten extents before IO is done and converts unwritten
    extents afterwards. However this way DAX IO can race with page fault to
    the same area:

    ext4_ext_direct_IO() dax_fault()
    dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
    get_block() - converts
    unwritten block to
    written and zeroes it
    out
    ext4_convert_unwritten_extents()

    So data written with DAX IO gets lost. Similarly dax_new_buf() called
    from dax_io() can overwrite data that has been already written to the
    block via mmap.

    Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
    use them for DAX mmap. The downside of this solution is that every
    allocating write writes each block twice (once zeros, once data). Fixing
    the race with locking is possible as well however we would need to
    lock-out faults for the whole range written to by DAX IO. And that is
    not easy to do without locking-out faults for the whole file which seems
    too aggressive.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
    and ext4_ind_direct_IO(). However the extent based function calls into
    the indirect based one for some cases and for example it is not able to
    handle file extending. Previously it was not also properly handling
    retries in case of ENOSPC errors. With DAX things would get even more
    contrieved so just refactor the direct IO code and instead of indirect /
    extent split do the split to read vs writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_dax_get_blocks() was accidentally omitted fixing get blocks
    handlers to properly handle transient ENOSPC errors. Fix it now to use
    ext4_get_blocks_trans() helper which takes care of these errors.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

02 May, 2016

1 commit


26 Apr, 2016

2 commits

  • In ext4, there is a race condition between changing inode journal mode
    and ext4_writepages(). While ext4_writepages() is executed on a
    non-journalled mode inode, the inode's journal mode could be enabled
    by ioctl() and then, some pages dirtied after switching the journal
    mode will be still exposed to ext4_writepages() in non-journaled mode.
    To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
    Kara's suggestion because we don't want to waste ext4_inode_info's
    space for this extra rare case.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Daeho Jeong
     
  • We already allocate delalloc blocks before changing the inode mode into
    "per-file data journal" mode to prevent delalloc blocks from remaining
    not allocated, but another issue concerned with "BH_Unwritten" status
    still exists. For example, by fallocate(), several buffers' status
    change into "BH_Unwritten", but these buffers cannot be processed by
    ext4_alloc_da_blocks(). So, they still remain in unwritten status after
    per-file data journaling is enabled and they cannot be changed into
    written status any more and, if they are journaled and eventually
    checkpointed, these unwritten buffer will cause a kernel panic by the
    below BUG_ON() function of submit_bh_wbc() when they are submitted
    during checkpointing.

    static int submit_bh_wbc(int rw, struct buffer_head *bh,...
    {
    ...
    BUG_ON(buffer_unwritten(bh));

    Moreover, when "dioread_nolock" option is enabled, the status of a
    buffer is changed into "BH_Unwritten" after write_begin() completes and
    the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
    if a buffer's status is changed into unwrutten but the buffer's I/O is
    not submitted and completed, it can cause the same problem after
    enabling per-file data journaling. You can easily generate this bug by
    executing the following command.

    ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269

    To resolve these problems and define a boundary between the previous
    mode and per-file data journaling mode, we need to flush and wait all
    the I/O of buffers of a file before enabling per-file data journaling
    of the file.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Daeho Jeong
     

24 Apr, 2016

3 commits

  • Currently we ask jbd2 to write all dirty allocated buffers before
    committing a transaction when doing writeback of delay allocated blocks.
    However this is unnecessary since we move all pages to writeback state
    before dropping a transaction handle and then submit all the necessary
    IO. We still need the transaction commit to wait for all the outstanding
    writeback before flushing disk caches during transaction commit to avoid
    data exposure issues though. Use the new jbd2 capability and ask it to
    only wait for outstanding writeback during transaction commit when
    writing back data in ext4_writepages().

    Tested-by: "HUANG Weller (CM/ESW12-CN)"
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • This flag is just duplicating what ext4_should_order_data() tells you
    and is used in a single place. Furthermore it doesn't reflect changes to
    inode data journalling flag so it may be possibly misleading. Just
    remove it.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Huang has reported that in his powerfail testing he is seeing stale
    block contents in some of recently allocated blocks although he mounts
    ext4 in data=ordered mode. After some investigation I have found out
    that indeed when delayed allocation is used, we don't add inode to
    transaction's list of inodes needing flushing before commit. Originally
    we were doing that but commit f3b59291a69d removed the logic with a
    flawed argument that it is not needed.

    The problem is that although for delayed allocated blocks we write their
    contents immediately after allocating them, there is no guarantee that
    the IO scheduler or device doesn't reorder things and thus transaction
    allocating blocks and attaching them to inode can reach stable storage
    before actual block contents. Actually whenever we attach freshly
    allocated blocks to inode using a written extent, we should add inode to
    transaction's ordered inode list to make sure we properly wait for block
    contents to be written before committing the transaction. So that is
    what we do in this patch. This also handles other cases where stale data
    exposure was possible - like filling hole via mmap in
    data=ordered,nodelalloc mode.

    The only exception to the above rule are extending direct IO writes where
    blkdev_direct_IO() waits for IO to complete before increasing i_size and
    thus stale data exposure is not possible. For now we don't complicate
    the code with optimizing this special case since the overhead is pretty
    low. In case this is observed to be a performance problem we can always
    handle it using a special flag to ext4_map_blocks().

    CC: stable@vger.kernel.org
    Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
    Reported-by: "HUANG Weller (CM/ESW12-CN)"
    Tested-by: "HUANG Weller (CM/ESW12-CN)"
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

08 Apr, 2016

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "These changes contains a fix for overlayfs interacting with some
    (badly behaved) dentry code in various file systems. These have been
    reviewed by Al and the respective file system mtinainers and are going
    through the ext4 tree for convenience.

    This also has a few ext4 encryption bug fixes that were discovered in
    Android testing (yes, we will need to get these sync'ed up with the
    fs/crypto code; I'll take care of that). It also has some bug fixes
    and a change to ignore the legacy quota options to allow for xfstests
    regression testing of ext4's internal quota feature and to be more
    consistent with how xfs handles this case"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: ignore quota mount options if the quota feature is enabled
    ext4 crypto: fix some error handling
    ext4: avoid calling dquot_get_next_id() if quota is not enabled
    ext4: retry block allocation for failed DIO and DAX writes
    ext4: add lockdep annotations for i_data_sem
    ext4: allow readdir()'s of large empty directories to be interrupted
    btrfs: fix crash/invalid memory access on fsync when using overlayfs
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4: use file_dentry()
    ext4: use dget_parent() in ext4_file_open()
    nfs: use file_dentry()
    fs: add file_dentry()
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()

    Linus Torvalds
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Apr, 2016

1 commit

  • Currently if block allocation for DIO or DAX write fails due to ENOSPC,
    we just returned it to userspace. However these ENOSPC errors can be
    transient because the transaction freeing blocks has not yet committed.
    This demonstrates as failures of generic/102 test when the filesystem is
    mounted with 'dax' mount option.

    Fix the problem by properly retrying the allocation in case of ENOSPC
    error in get blocks functions used for direct IO.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Tested-by: Ross Zwisler

    Jan Kara
     

22 Mar, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Performance improvements in SEEK_DATA and xattr scalability
    improvements, plus a lot of clean ups and bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
    ext4: clean up error handling in the MMP support
    jbd2: do not fail journal because of frozen_buffer allocation failure
    ext4: use __GFP_NOFAIL in ext4_free_blocks()
    ext4: fix compile error while opening the macro DOUBLE_CHECK
    ext4: print ext4 mount option data_err=abort correctly
    ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
    ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
    ext4: fix misspellings in comments.
    jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
    ext4: more efficient SEEK_DATA implementation
    ext4: cleanup handling of bh->b_state in DAX mmap
    ext4: return hole from ext4_map_blocks()
    ext4: factor out determining of hole size
    ext4: fix setting of referenced bit in ext4_es_lookup_extent()
    ext4: remove i_ioend_count
    ext4: simplify io_end handling for AIO DIO
    ext4: move trans handling and completion deferal out of _ext4_get_block
    ext4: rename and split get blocks functions
    ext4: use i_mutex to serialize unaligned AIO DIO
    ext4: pack ioend structure better
    ...

    Linus Torvalds
     

13 Mar, 2016

1 commit

  • ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on
    error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is
    ignored in the following "if" condition and ext4_expand_extra_isize()
    might be called with NULL iloc.bh set, which triggers NULL pointer
    dereference.

    This is uncovered by commit 8b4953e13f4c ("ext4: reserve code points for
    the project quota feature"), which enlarges the ext4_inode size, and
    run the following script on new kernel but with old mke2fs:

    #/bin/bash
    mnt=/mnt/ext4
    devname=ext4-error
    dev=/dev/mapper/$devname
    fsimg=/home/fs.img

    trap cleanup 0 1 2 3 9 15

    cleanup()
    {
    umount $mnt >/dev/null 2>&1
    dmsetup remove $devname
    losetup -d $backend_dev
    rm -f $fsimg
    exit 0
    }

    rm -f $fsimg
    fallocate -l 1g $fsimg
    backend_dev=`losetup -f --show $fsimg`
    devsize=`blockdev --getsz $backend_dev`

    good_tab="0 $devsize linear $backend_dev 0"
    error_tab="0 $devsize error $backend_dev 0"

    dmsetup create $devname --table "$good_tab"

    mkfs -t ext4 $dev
    mount -t ext4 -o errors=continue,strictatime $dev $mnt

    dmsetup load $devname --table "$error_tab" && dmsetup resume $devname
    echo 3 > /proc/sys/vm/drop_caches
    ls -l $mnt
    exit 0

    [ Patch changed to simplify the function a tiny bit. -- Ted ]

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o

    Eryu Guan
     

10 Mar, 2016

3 commits

  • Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
    ext4_seek_data() iterates hole block-by-block. Fix the problem by using
    returned hole size from ext4_map_blocks() and thus skip the hole in one
    go.

    Update also SEEK_HOLE implementation to follow the same pattern as
    SEEK_DATA to make future maintenance easier.

    Furthermore we add cond_resched() to both ext4_seek_data() and
    ext4_seek_hole() to avoid softlockups in case evil user creates huge
    fragmented file and we have to go through lots of extents.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_dax_mmap_get_block() updates bh->b_state directly instead of using
    ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code
    always passes on-stack buffer_head but clean this up to make code more
    uniform.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, ext4_map_blocks() just returns 0 when it finds a hole and
    allocation is not requested. However we have all the information
    available to tell how large the hole actually is and there are callers
    of ext4_map_blocks() which would save some block-by-block hole iteration
    if they knew this information. So fill in struct ext4_map_blocks even
    for holes with the information we have. We keep returning 0 for holes to
    maintain backward compatibility of the function.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

09 Mar, 2016

1 commit