28 May, 2016

2 commits

  • Pull vfs fixes from Al Viro:
    "Followups to the parallel lookup work:

    - update docs

    - restore killability of the places that used to take ->i_mutex
    killably now that we have down_write_killable() merged

    - Additionally, it turns out that I missed a prerequisite for
    security_d_instantiate() stuff - ->getxattr() wasn't the only thing
    that could be called before dentry is attached to inode; with smack
    we needed the same treatment applied to ->setxattr() as well"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->setxattr() to passing dentry and inode separately
    switch xattr_handler->set() to passing dentry and inode separately
    restore killability of old mutex_lock_killable(&inode->i_mutex) users
    add down_write_killable_nested()
    update D/f/directory-locking

    Linus Torvalds
     
  • preparation for similar switch in ->setxattr() (see the next commit for
    rationale).

    Signed-off-by: Al Viro

    Al Viro
     

27 May, 2016

1 commit

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

25 May, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Fix a number of bugs, most notably a potential stale data exposure
    after a crash and a potential BUG_ON crash if a file has the data
    journalling flag enabled while it has dirty delayed allocation blocks
    that haven't been written yet. Also fix a potential crash in the new
    project quota code and a maliciously corrupted file system.

    In addition, fix some DAX-specific bugs, including when there is a
    transient ENOSPC situation and races between writes via direct I/O and
    an mmap'ed segment that could lead to lost I/O.

    Finally the usual set of miscellaneous cleanups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
    ext4: pre-zero allocated blocks for DAX IO
    ext4: refactor direct IO code
    ext4: fix race in transient ENOSPC detection
    ext4: handle transient ENOSPC properly for DAX
    dax: call get_blocks() with create == 1 for write faults to unwritten extents
    ext4: remove unmeetable inconsisteny check from ext4_find_extent()
    jbd2: remove excess descriptions for handle_s
    ext4: remove unnecessary bio get/put
    ext4: silence UBSAN in ext4_mb_init()
    ext4: address UBSAN warning in mb_find_order_for_block()
    ext4: fix oops on corrupted filesystem
    ext4: fix check of dqget() return value in ext4_ioctl_setproject()
    ext4: clean up error handling when orphan list is corrupted
    ext4: fix hang when processing corrupted orphaned inode list
    ext4: remove trailing \n from ext4_warning/ext4_error calls
    ext4: fix races between changing inode journal mode and ext4_writepages
    ext4: handle unwritten or delalloc buffers before enabling data journaling
    ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
    ext4: do not ask jbd2 to write data for delalloc buffers
    jbd2: add support for avoiding data writes during transaction commits
    ...

    Linus Torvalds
     

21 May, 2016

1 commit

  • Let's gather the UUID related functions under one hood.

    Signed-off-by: Andy Shevchenko
    Reviewed-by: Matt Fleming
    Cc: Dmitry Kasatkin
    Cc: Mimi Zohar
    Cc: Rasmus Villemoes
    Cc: Arnd Bergmann
    Cc: "Theodore Ts'o"
    Cc: Al Viro
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     

18 May, 2016

1 commit

  • Pull vfs cleanups from Al Viro:
    "More cleanups from Christoph"

    * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nfsd: use RWF_SYNC
    fs: add RWF_DSYNC aand RWF_SYNC
    ceph: use generic_write_sync
    fs: simplify the generic_write_sync prototype
    fs: add IOCB_SYNC and IOCB_DSYNC
    direct-io: remove the offset argument to dio_complete
    direct-io: eliminate the offset argument to ->direct_IO
    xfs: eliminate the pos variable in xfs_file_dio_aio_write
    filemap: remove the pos argument to generic_file_direct_write
    filemap: remove pos variables in generic_file_read_iter

    Linus Torvalds
     

17 May, 2016

3 commits

  • When a partition is not aligned by 4KB, mount -o dax succeeds,
    but any read/write access to the filesystem fails, except for
    metadata update.

    Call bdev_dax_supported() to perform proper precondition checks
    which includes this partition alignment check.

    Reported-by: Micah Parrish
    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • Backmerge to resolve a conflict in ovl_lookup_real();
    "ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
    but it was too late in the cycle to rebase.

    Al Viro
     
  • Fault handlers currently take complete_unwritten argument to convert
    unwritten extents after PTEs are updated. However no filesystem uses
    this anymore as the code is racy. Remove the unused argument.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     

13 May, 2016

5 commits

  • Currently ext4 treats DAX IO the same way as direct IO. I.e., it
    allocates unwritten extents before IO is done and converts unwritten
    extents afterwards. However this way DAX IO can race with page fault to
    the same area:

    ext4_ext_direct_IO() dax_fault()
    dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
    get_block() - converts
    unwritten block to
    written and zeroes it
    out
    ext4_convert_unwritten_extents()

    So data written with DAX IO gets lost. Similarly dax_new_buf() called
    from dax_io() can overwrite data that has been already written to the
    block via mmap.

    Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
    use them for DAX mmap. The downside of this solution is that every
    allocating write writes each block twice (once zeros, once data). Fixing
    the race with locking is possible as well however we would need to
    lock-out faults for the whole range written to by DAX IO. And that is
    not easy to do without locking-out faults for the whole file which seems
    too aggressive.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
    and ext4_ind_direct_IO(). However the extent based function calls into
    the indirect based one for some cases and for example it is not able to
    handle file extending. Previously it was not also properly handling
    retries in case of ENOSPC errors. With DAX things would get even more
    contrieved so just refactor the direct IO code and instead of indirect /
    extent split do the split to read vs writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When there are blocks to free in the running transaction, block
    allocator can return ENOSPC although the filesystem has some blocks to
    free. We use ext4_should_retry_alloc() to force commit of the current
    transaction and return whether anything was committed so that it makes
    sense to retry the allocation. However the transaction may get committed
    after block allocation fails but before we call
    ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
    because there is nothing to commit and we wrongly return ENOSPC.

    Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
    when we tried to commit a transaction. This should not add any
    unnecessary retries since we had a transaction running a while ago when
    trying to allocate blocks and we want to retry the allocation once that
    transaction has committed anyway.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_dax_get_blocks() was accidentally omitted fixing get blocks
    handlers to properly handle transient ENOSPC errors. Fix it now to use
    ext4_get_blocks_trans() helper which takes care of these errors.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Note that we need relax_dir() equivalent for directories
    locked shared.

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2016

4 commits

  • ext4_find_extent(), stripped down to the parts relevant to this patch,
    reads as

    ppos = 0;
    i = depth;
    while (i) {
    --i;
    ++ppos;
    if (unlikely(ppos > depth)) {
    ...
    ret = -EFSCORRUPTED;
    goto err;
    }
    }

    Due to the loop's bounds, the condition ppos > depth can never be met.

    Remove this dead code.

    Signed-off-by: Nicolai Stange
    Signed-off-by: Theodore Ts'o

    Nicolai Stange
     
  • ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
    which is why it had to get an extra reference to the bio before
    submitting it. But since we no longer touch the bio after submission,
    get rid of the redundant get/put of the bio. If we do get the extra
    reference, we enter the slower path of having to flag this bio as now
    having external references.

    Signed-off-by: Jens Axboe
    Signed-off-by: Theodore Ts'o

    Jens Axboe
     
  • Currently, in ext4_mb_init(), there's a loop like the following:

    do {
    ...
    offset += 1 << (sb->s_blocksize_bits - i);
    i++;
    } while (i s_blocksize_bits + 1);

    Note that the updated offset is used in the loop's next iteration only.

    However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
    the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
    and UBSAN reports

    UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
    shift exponent 4294967295 is too large for 32-bit type 'int'
    [...]
    Call Trace:
    [] dump_stack+0xbc/0x117
    [] ? _atomic_dec_and_lock+0x169/0x169
    [] ubsan_epilogue+0xd/0x4e
    [] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
    [] ? __ubsan_handle_load_invalid_value+0x158/0x158
    [] ? kmem_cache_alloc+0x101/0x390
    [] ? ext4_mb_init+0x13b/0xfd0
    [] ? create_cache+0x57/0x1f0
    [] ? create_cache+0x11a/0x1f0
    [] ? mutex_lock+0x38/0x60
    [] ? mutex_unlock+0x1b/0x50
    [] ? put_online_mems+0x5b/0xc0
    [] ? kmem_cache_create+0x117/0x2c0
    [] ext4_mb_init+0xc49/0xfd0
    [...]

    Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

    Unless compilers start to do some fancy transformations (which at least
    GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
    such calculated value of offset is never used again.

    Silence UBSAN by introducing another variable, offset_incr, holding the
    next increment to apply to offset and adjust that one by right shifting it
    by one position per loop iteration.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

    Cc: stable@vger.kernel.org
    Signed-off-by: Nicolai Stange
    Signed-off-by: Theodore Ts'o

    Nicolai Stange
     
  • Currently, in mb_find_order_for_block(), there's a loop like the following:

    while (order bd_blkbits + 1) {
    ...
    bb += 1 << (e4b->bd_blkbits - order);
    }

    Note that the updated bb is used in the loop's next iteration only.

    However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
    the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

    UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
    shift exponent -1 is negative
    [...]
    Call Trace:
    [] dump_stack+0xbc/0x117
    [] ? _atomic_dec_and_lock+0x169/0x169
    [] ubsan_epilogue+0xd/0x4e
    [] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
    [] ? __ubsan_handle_load_invalid_value+0x158/0x158
    [] ? ext4_mb_generate_from_pa+0x590/0x590
    [] ? ext4_read_block_bitmap_nowait+0x598/0xe80
    [] mb_find_order_for_block+0x1ce/0x240
    [...]

    Unless compilers start to do some fancy transformations (which at least
    GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
    such calculated value of bb is never used again.

    Silence UBSAN by introducing another variable, bb_incr, holding the next
    increment to apply to bb and adjust that one by right shifting it by one
    position per loop iteration.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

    Cc: stable@vger.kernel.org
    Signed-off-by: Nicolai Stange
    Signed-off-by: Theodore Ts'o

    Nicolai Stange
     

05 May, 2016

2 commits

  • When filesystem is corrupted in the right way, it can happen
    ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
    subsequently remove inode from the in-memory orphan list. However this
    deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
    leave i_orphan list_head with a stale content. Later we can look at this
    content causing list corruption, oops, or other issues. The reported
    trace looked like:

    WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
    list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
    0000000000100100)
    CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
    Stack:
    60462947 62219960 602ede24 62219960
    602ede24 603ca293 622198f0 602f02eb
    62219950 6002c12c 62219900 601b4d6b
    Call Trace:
    [] ? vprintk_emit+0x2dc/0x5c0
    [] ? printk+0x0/0x94
    [] show_stack+0xdc/0x1a0
    [] ? printk+0x0/0x94
    [] ? printk+0x0/0x94
    [] dump_stack+0x2a/0x2c
    [] warn_slowpath_common+0x9c/0xf0
    [] ? __list_del_entry+0x6b/0x100
    [] warn_slowpath_fmt+0x94/0xa0
    [] ? __mutex_lock_slowpath+0x239/0x3a0
    [] ? warn_slowpath_fmt+0x0/0xa0
    [] ? set_signals+0x3f/0x50
    [] ? kmem_cache_free+0x10a/0x180
    [] ? mutex_lock+0x18/0x30
    [] __list_del_entry+0x6b/0x100
    [] ext4_orphan_del+0x22c/0x2f0
    [] ? __ext4_journal_start_sb+0x2c/0xa0
    [] ? ext4_truncate+0x383/0x390
    [] ext4_write_begin+0x30b/0x4b0
    [] ? copy_from_user+0x0/0xb0
    [] ? iov_iter_fault_in_readable+0xa0/0xc0
    [] generic_perform_write+0xaf/0x1e0
    [] ? file_update_time+0x46/0x110
    [] __generic_file_write_iter+0x18f/0x1b0
    [] ext4_file_write_iter+0x15f/0x470
    [] ? unlink_file_vma+0x0/0x70
    [] ? unlink_anon_vmas+0x0/0x260
    [] ? free_pgtables+0xb9/0x100
    [] __vfs_write+0xb0/0x130
    [] vfs_write+0xa5/0x170
    [] SyS_write+0x56/0xe0
    [] ? __libc_waitpid+0x0/0xa0
    [] handle_syscall+0x68/0x90
    [] userspace+0x4fd/0x600
    [] ? save_registers+0x1f/0x40
    [] ? arch_prctl+0x177/0x1b0
    [] fork_handler+0x85/0x90

    Fix the problem by using list_del_init() as we always should with
    i_orphan list.

    CC: stable@vger.kernel.org
    Reported-by: Vegard Nossum
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • A failed call to dqget() returns an ERR_PTR() and not null. Fix
    the check in ext4_ioctl_setproject() to handle this correctly.

    Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
    Cc: stable@vger.kernel.org # v4.5
    Signed-off-by: Seth Forshee
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Seth Forshee
     

03 May, 2016

1 commit


02 May, 2016

3 commits


30 Apr, 2016

2 commits

  • Instead of just printing warning messages, if the orphan list is
    corrupted, declare the file system is corrupted. If there are any
    reserved inodes in the orphaned inode list, declare the file system
    corrupted and stop right away to avoid doing more potential damage to
    the file system.

    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • If the orphaned inode list contains inode #5, ext4_iget() returns a
    bad inode (since the bootloader inode should never be referenced
    directly). Because of the bad inode, we end up processing the inode
    repeatedly and this hangs the machine.

    This can be reproduced via:

    mke2fs -t ext4 /tmp/foo.img 100
    debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
    mount -o loop /tmp/foo.img /mnt

    (But don't do this if you are using an unpatched kernel if you care
    about the system staying functional. :-)

    This bug was found by the port of American Fuzzy Lop into the kernel
    to find file system problems[1]. (Since it *only* happens if inode #5
    shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
    surprising that AFL needed two hours before it found it.)

    [1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

    Cc: stable@vger.kernel.org
    Reported by: Vegard Nossum
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

27 Apr, 2016

1 commit


26 Apr, 2016

3 commits

  • In ext4, there is a race condition between changing inode journal mode
    and ext4_writepages(). While ext4_writepages() is executed on a
    non-journalled mode inode, the inode's journal mode could be enabled
    by ioctl() and then, some pages dirtied after switching the journal
    mode will be still exposed to ext4_writepages() in non-journaled mode.
    To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
    Kara's suggestion because we don't want to waste ext4_inode_info's
    space for this extra rare case.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Daeho Jeong
     
  • We already allocate delalloc blocks before changing the inode mode into
    "per-file data journal" mode to prevent delalloc blocks from remaining
    not allocated, but another issue concerned with "BH_Unwritten" status
    still exists. For example, by fallocate(), several buffers' status
    change into "BH_Unwritten", but these buffers cannot be processed by
    ext4_alloc_da_blocks(). So, they still remain in unwritten status after
    per-file data journaling is enabled and they cannot be changed into
    written status any more and, if they are journaled and eventually
    checkpointed, these unwritten buffer will cause a kernel panic by the
    below BUG_ON() function of submit_bh_wbc() when they are submitted
    during checkpointing.

    static int submit_bh_wbc(int rw, struct buffer_head *bh,...
    {
    ...
    BUG_ON(buffer_unwritten(bh));

    Moreover, when "dioread_nolock" option is enabled, the status of a
    buffer is changed into "BH_Unwritten" after write_begin() completes and
    the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
    if a buffer's status is changed into unwrutten but the buffer's I/O is
    not submitted and completed, it can cause the same problem after
    enabling per-file data journaling. You can easily generate this bug by
    executing the following command.

    ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269

    To resolve these problems and define a boundary between the previous
    mode and per-file data journaling mode, we need to flush and wait all
    the I/O of buffers of a file before enabling per-file data journaling
    of the file.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Daeho Jeong
     
  • The function jbd2_journal_extend() takes as its argument the number of
    new credits to be added to the handle. We weren't taking into account
    the currently unused handle credits; worse, we would try to extend the
    handle by N credits when it had N credits available.

    In the case where jbd2_journal_extend() fails because the transaction
    is too large, when jbd2_journal_restart() gets called, the N credits
    owned by the handle gets returned to the transaction, and the
    transaction commit is asynchronously requested, and then
    start_this_handle() will be able to successfully attach the handle to
    the current transaction since the required credits are now available.

    This is mostly harmless, but since ext4_ext_truncate_extend_restart()
    returns EAGAIN, the truncate machinery will once again try to call
    ext4_ext_truncate_extend_restart(), which will do the above sequence
    over and over again until the transaction has committed.

    This was found while I was debugging a lockup in caused by running
    xfstests generic/074 in the data=journal case. I'm still not sure why
    we ended up looping forever, which suggests there may still be another
    bug hiding in the transaction accounting machinery, but this commit
    prevents us from looping in the first place.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

24 Apr, 2016

5 commits

  • Currently we ask jbd2 to write all dirty allocated buffers before
    committing a transaction when doing writeback of delay allocated blocks.
    However this is unnecessary since we move all pages to writeback state
    before dropping a transaction handle and then submit all the necessary
    IO. We still need the transaction commit to wait for all the outstanding
    writeback before flushing disk caches during transaction commit to avoid
    data exposure issues though. Use the new jbd2 capability and ask it to
    only wait for outstanding writeback during transaction commit when
    writing back data in ext4_writepages().

    Tested-by: "HUANG Weller (CM/ESW12-CN)"
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently when filesystem needs to make sure data is on permanent
    storage before committing a transaction it adds inode to transaction's
    inode list. During transaction commit, jbd2 writes back all dirty
    buffers that have allocated underlying blocks and waits for the IO to
    finish. However when doing writeback for delayed allocated data, we
    allocate blocks and immediately submit the data. Thus asking jbd2 to
    write dirty pages just unnecessarily adds more work to jbd2 possibly
    writing back other redirtied blocks.

    Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
    outstanding data writes before committing a transaction and thus avoid
    unnecessary writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • This flag is just duplicating what ext4_should_order_data() tells you
    and is used in a single place. Furthermore it doesn't reflect changes to
    inode data journalling flag so it may be possibly misleading. Just
    remove it.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Huang has reported that in his powerfail testing he is seeing stale
    block contents in some of recently allocated blocks although he mounts
    ext4 in data=ordered mode. After some investigation I have found out
    that indeed when delayed allocation is used, we don't add inode to
    transaction's list of inodes needing flushing before commit. Originally
    we were doing that but commit f3b59291a69d removed the logic with a
    flawed argument that it is not needed.

    The problem is that although for delayed allocated blocks we write their
    contents immediately after allocating them, there is no guarantee that
    the IO scheduler or device doesn't reorder things and thus transaction
    allocating blocks and attaching them to inode can reach stable storage
    before actual block contents. Actually whenever we attach freshly
    allocated blocks to inode using a written extent, we should add inode to
    transaction's ordered inode list to make sure we properly wait for block
    contents to be written before committing the transaction. So that is
    what we do in this patch. This also handles other cases where stale data
    exposure was possible - like filling hole via mmap in
    data=ordered,nodelalloc mode.

    The only exception to the above rule are extending direct IO writes where
    blkdev_direct_IO() waits for IO to complete before increasing i_size and
    thus stale data exposure is not possible. For now we don't complicate
    the code with optimizing this special case since the overhead is pretty
    low. In case this is observed to be a performance problem we can always
    handle it using a special flag to ext4_map_blocks().

    CC: stable@vger.kernel.org
    Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
    Reported-by: "HUANG Weller (CM/ESW12-CN)"
    Tested-by: "HUANG Weller (CM/ESW12-CN)"
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • If a directory has a large number of empty blocks, iterating over all
    of them can take a long time, leading to scheduler warnings and users
    getting irritated when they can't kill a process in the middle of one
    of these long-running readdir operations. Fix this by adding checks to
    ext4_readdir() and ext4_htree_fill_tree().

    This was reverted earlier due to a typo in the original commit where I
    experimented with using signal_pending() instead of
    fatal_signal_pending(). The test was in the wrong place if we were
    going to return signal_pending() since we would end up returning
    duplicant entries. See 9f2394c9be47 for a more detailed explanation.

    Added fix as suggested by Linus to check for signal_pending() in
    in the filldir() functions.

    Reported-by: Benjamin LaHaise
    Google-Bug-Id: 27880676
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

13 Apr, 2016

1 commit


11 Apr, 2016

3 commits

  • ... and do not assume they are already attached to each other

    Signed-off-by: Al Viro

    Al Viro
     
  • This reverts commit 1028b55bafb7611dda1d8fed2aeca16a436b7dff.

    It's broken: it makes ext4 return an error at an invalid point, causing
    the readdir wrappers to write the the position of the last successful
    directory entry into the position field, which means that the next
    readdir will now return that last successful entry _again_.

    You can only return fatal errors (that terminate the readdir directory
    walk) from within the filesystem readdir functions, the "normal" errors
    (that happen when the readdir buffer fills up, for example) happen in
    the iterorator where we know the position of the actual failing entry.

    I do have a very different patch that does the "signal_pending()"
    handling inside the iterator function where it is allowable, but while
    that one passes all the sanity checks, I screwed up something like four
    times while emailing it out, so I'm not going to commit it today.

    So my track record is not good enough, and the stars will have to align
    better before that one gets committed. And it would be good to get some
    review too, of course, since celestial alignments are always an iffy
    debugging model.

    IOW, let's just revert the commit that caused the problem for now.

    Reported-by: Greg Thelen
    Cc: Theodore Ts'o
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • ... and neither can ever be NULL

    Signed-off-by: Al Viro

    Al Viro
     

08 Apr, 2016

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "These changes contains a fix for overlayfs interacting with some
    (badly behaved) dentry code in various file systems. These have been
    reviewed by Al and the respective file system mtinainers and are going
    through the ext4 tree for convenience.

    This also has a few ext4 encryption bug fixes that were discovered in
    Android testing (yes, we will need to get these sync'ed up with the
    fs/crypto code; I'll take care of that). It also has some bug fixes
    and a change to ignore the legacy quota options to allow for xfstests
    regression testing of ext4's internal quota feature and to be more
    consistent with how xfs handles this case"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: ignore quota mount options if the quota feature is enabled
    ext4 crypto: fix some error handling
    ext4: avoid calling dquot_get_next_id() if quota is not enabled
    ext4: retry block allocation for failed DIO and DAX writes
    ext4: add lockdep annotations for i_data_sem
    ext4: allow readdir()'s of large empty directories to be interrupted
    btrfs: fix crash/invalid memory access on fsync when using overlayfs
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4: use file_dentry()
    ext4: use dget_parent() in ext4_file_open()
    nfs: use file_dentry()
    fs: add file_dentry()
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()

    Linus Torvalds