20 May, 2016

6 commits

  • When doing cow faults, we cannot directly fill in PTE as we do for other
    faults as we rely on generic code to do proper accounting of the cowed page.
    We also have no page to lock to protect against races with truncate as
    other faults have and we need the protection to extend until the moment
    generic code inserts cowed page into PTE thus at that point we have no
    protection of fs-specific i_mmap_sem. So far we relied on using
    i_mmap_lock for the protection however that is completely special to cow
    faults. To make fault locking more uniform use DAX entry lock instead.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently DAX page fault locking is racy.

    CPU0 (write fault) CPU1 (read fault)

    __dax_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
    get_block(inode, block, &bh, 1) -> allocates blocks
    if (page) -> no
    if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE) {
    } else {
    dax_load_hole();
    }
    dax_insert_mapping()

    And we are in a situation where we fail in dax_radix_entry() with -EIO.

    Another problem with the current DAX page fault locking is that there is
    no race-free way to clear dirty tag in the radix tree. We can always
    end up with clean radix tree and dirty data in CPU cache.

    We fix the first problem by introducing locking of exceptional radix
    tree entries in DAX mappings acting very similarly to page lock and thus
    synchronizing properly faults against the same mapping index. The same
    lock can later be used to avoid races when clearing radix tree dirty
    tag.

    Reviewed-by: NeilBrown
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently we forbid page_cache_tree_insert() to replace exceptional radix
    tree entries for DAX inodes. However to make DAX faults race free we will
    lock radix tree entries and when hole is created, we need to replace
    such locked radix tree entry with a hole page. So modify
    page_cache_tree_insert() to allow that.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • We will use lowest available bit in the radix tree exceptional entry for
    locking of the entry. Define it. Also clean up definitions of DAX entry
    type bits in DAX exceptional entries to use defined constants instead of
    hardcoding numbers and cleanup checking of these bits to not rely on how
    other bits in the entry are set.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently the handling of huge pages for DAX is racy. For example the
    following can happen:

    CPU0 (THP write fault) CPU1 (normal read fault)

    __dax_pmd_fault() __dax_fault()
    get_block(inode, block, &bh, 0) -> not mapped
    get_block(inode, block, &bh, 0)
    -> not mapped
    if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
    truncate_pagecache_range(inode, lstart, lend);
    dax_load_hole();

    This results in data corruption since process on CPU1 won't see changes
    into the file done by CPU0.

    The race can happen even if two normal faults race however with THP the
    situation is even worse because the two faults don't operate on the same
    entries in the radix tree and we want to use these entries for
    serialization. So make THP support in DAX code depend on CONFIG_BROKEN
    for now.

    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
    returned buffer has BH_Uptodate set. However that doesn't get set for
    any mapping buffer so that branch is actually a dead code. The
    BH_Uptodate check doesn't make any sense so just remove it.

    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

19 May, 2016

5 commits

  • The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in

    09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
    macros

    The comments for the above functions described a distinction between
    those, that is now redundant, so remove those paragraphs

    Cc: Kirill A. Shutemov
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Vishal Verma

    Vishal Verma
     
  • In the truncate or hole-punch path in dax, we clear out sub-page ranges.
    If these sub-page ranges are sector aligned and sized, we can do the
    zeroing through the driver instead so that error-clearing is handled
    automatically.

    For sub-sector ranges, we still have to rely on clear_pmem and have the
    possibility of tripping over errors.

    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Vishal Verma

    Vishal Verma
     
  • This allows XFS to perform zeroing using the iomap infrastructure and
    avoid buffer heads.

    Reviewed-by: Jan Kara
    Signed-off-by: Christoph Hellwig
    [vishal: fix conflicts with dax-error-handling]
    Signed-off-by: Vishal Verma

    Christoph Hellwig
     
  • dax_clear_sectors() cannot handle poisoned blocks. These must be
    zeroed using the BIO interface instead. Convert ext2 and XFS to use
    only sb_issue_zerout().

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Matthew Wilcox
    [vishal: Also remove the dax_clear_sectors function entirely]
    Signed-off-by: Vishal Verma

    Matthew Wilcox
     
  • 1/ If a mapping overlaps a bad sector fail the request.

    2/ Do not opportunistically report more dax-capable capacity than is
    requested when errors present.

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    [vishal: fix a conflict with system RAM collision patches]
    [vishal: add a 'size' parameter to ->direct_access]
    [vishal: fix a conflict with DAX alignment check patches]
    Signed-off-by: Vishal Verma

    Dan Williams
     

17 May, 2016

15 commits

  • In preparation for consulting a badblocks list in pmem_direct_access(),
    teach dax_pmd_fault() to fallback rather than fail immediately upon
    encountering an error. The thought being that reducing the span of the
    dax request may avoid the error region.

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Vishal Verma

    Dan Williams
     
  • blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
    to remain as a separate interface for checking dax capability of
    a raw block device.

    Rename and relocate blkdev_dax_capable() to keep them maintained
    consistently, and call bdev_direct_access() for the dax capability
    check.

    There is no change in the behavior.

    Link: https://lkml.org/lkml/2016/5/9/950
    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • When a partition is not aligned by 4KB, mount -o dax succeeds,
    but any read/write access to the filesystem fails, except for
    metadata update.

    Call bdev_dax_supported() to perform proper precondition checks
    which includes this partition alignment check.

    Signed-off-by: Toshi Kani
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • When a partition is not aligned by 4KB, mount -o dax succeeds,
    but any read/write access to the filesystem fails, except for
    metadata update.

    Call bdev_dax_supported() to perform proper precondition checks
    which includes this partition alignment check.

    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • When a partition is not aligned by 4KB, mount -o dax succeeds,
    but any read/write access to the filesystem fails, except for
    metadata update.

    Call bdev_dax_supported() to perform proper precondition checks
    which includes this partition alignment check.

    Reported-by: Micah Parrish
    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • DAX imposes additional requirements to a device. Add
    bdev_dax_supported() which performs all the precondition checks
    necessary for filesystem to mount the device with dax option.

    Also add a new check to verify if a partition is aligned by 4KB.
    When a partition is unaligned, any dax read/write access fails,
    except for metadata update.

    Signed-off-by: Toshi Kani
    Reviewed-by: Christoph Hellwig
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • In preparation of moving DAX capability checks to the block layer
    from filesystem code, add a VFS message interface that aligns with
    filesystem's message format.

    For instance, a vfs_msg() message followed by XFS messages in case
    of a dax mount error may look like:

    VFS (pmem0p1): error: unaligned partition for dax
    XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
    XFS (pmem0p1): Mounting V5 Filesystem
    :

    vfs_msg() is largely based on ext4_msg().

    Signed-off-by: Toshi Kani
    Reviewed-by: Christoph Hellwig
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • Callers of dax fault handlers must make sure these calls cannot race
    with truncate. Thus it is enough to check inode size when entering the
    function and we don't have to recheck it again later in the handler.
    Note that inode size itself can be decreased while the fault handler
    runs but filesystem locking prevents against any radix tree or block
    mapping information changes resulting from the truncate and that is what
    we really care about.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
    set. Presumably this was copied over from direct IO code. However DAX
    inodes have no pagecache pages to write so the call is pointless. Remove
    it.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • All the filesystems are now zeroing blocks themselves for DAX IO to avoid
    races between dax_io() and dax_fault(). Remove the zeroing code from
    dax_io() and add warning to catch the case when somebody unexpectedly
    returns new or unwritten buffer.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • Now that all filesystems zero out blocks allocated for a fault handler,
    we can just remove the zeroing from the handler itself. Also add checks
    that no filesystem returns to us unwritten or new buffer.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • Currently ext2 zeroes any data blocks allocated for DAX inode however it
    still returns them as BH_New. Thus DAX code zeroes them again in
    dax_insert_mapping() which can possibly overwrite the data that has been
    already stored to those blocks by a racing dax_io(). Avoid marking
    pre-zeroed buffers as new.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • When zeroing allocated blocks for DAX, we accidentally zeroed only the
    first allocated block instead of all of them. So far this problem is
    hidden by the fact that page faults always need only a single block and
    DAX write code zeroes blocks again. But the zeroing in DAX code is racy
    and needs to be removed so fix the zeroing in ext2 to zero all allocated
    blocks.

    Reported-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • Fault handlers currently take complete_unwritten argument to convert
    unwritten extents after PTEs are updated. However no filesystem uses
    this anymore as the code is racy. Remove the unused argument.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     
  • These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
    Let's try to maintain the idea that radix-tree simply implements an
    abstract data type.

    Acked-by: Ross Zwisler
    Reviewed-by: Matthew Wilcox
    Signed-off-by: NeilBrown
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    NeilBrown
     

13 May, 2016

5 commits

  • Currently ext4 treats DAX IO the same way as direct IO. I.e., it
    allocates unwritten extents before IO is done and converts unwritten
    extents afterwards. However this way DAX IO can race with page fault to
    the same area:

    ext4_ext_direct_IO() dax_fault()
    dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
    get_block() - converts
    unwritten block to
    written and zeroes it
    out
    ext4_convert_unwritten_extents()

    So data written with DAX IO gets lost. Similarly dax_new_buf() called
    from dax_io() can overwrite data that has been already written to the
    block via mmap.

    Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
    use them for DAX mmap. The downside of this solution is that every
    allocating write writes each block twice (once zeros, once data). Fixing
    the race with locking is possible as well however we would need to
    lock-out faults for the whole range written to by DAX IO. And that is
    not easy to do without locking-out faults for the whole file which seems
    too aggressive.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
    and ext4_ind_direct_IO(). However the extent based function calls into
    the indirect based one for some cases and for example it is not able to
    handle file extending. Previously it was not also properly handling
    retries in case of ENOSPC errors. With DAX things would get even more
    contrieved so just refactor the direct IO code and instead of indirect /
    extent split do the split to read vs writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When there are blocks to free in the running transaction, block
    allocator can return ENOSPC although the filesystem has some blocks to
    free. We use ext4_should_retry_alloc() to force commit of the current
    transaction and return whether anything was committed so that it makes
    sense to retry the allocation. However the transaction may get committed
    after block allocation fails but before we call
    ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
    because there is nothing to commit and we wrongly return ENOSPC.

    Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
    when we tried to commit a transaction. This should not add any
    unnecessary retries since we had a transaction running a while ago when
    trying to allocate blocks and we want to retry the allocation once that
    transaction has committed anyway.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_dax_get_blocks() was accidentally omitted fixing get blocks
    handlers to properly handle transient ENOSPC errors. Fix it now to use
    ext4_get_blocks_trans() helper which takes care of these errors.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, __dax_fault() does not call get_blocks() callback with create
    argument set, when we got back unwritten extent from the initial
    get_blocks() call during a write fault. This is because originally
    filesystems were supposed to convert unwritten extents to written ones
    using complete_unwritten() callback. Later this was abandoned in favor of
    using pre-zeroed blocks however the condition whether get_blocks() needs
    to be called with create == 1 remained.

    Fix the condition so that filesystems are not forced to zero-out and
    convert unwritten extents when get_blocks() is called with create == 0
    (which introduces unnecessary overhead for read faults and can be
    problematic as the filesystem may possibly be read-only).

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

06 May, 2016

5 commits

  • ext4_find_extent(), stripped down to the parts relevant to this patch,
    reads as

    ppos = 0;
    i = depth;
    while (i) {
    --i;
    ++ppos;
    if (unlikely(ppos > depth)) {
    ...
    ret = -EFSCORRUPTED;
    goto err;
    }
    }

    Due to the loop's bounds, the condition ppos > depth can never be met.

    Remove this dead code.

    Signed-off-by: Nicolai Stange
    Signed-off-by: Theodore Ts'o

    Nicolai Stange
     
  • Commit bf6993276f74 ("jbd2: Use tracepoints for history file")
    removed the members j_history, j_history_max and j_history_cur from struct
    handle_s but the descriptions stayed lingering. Removing them.

    Signed-off-by: Luis de Bethencourt
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Luis de Bethencourt
     
  • ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
    which is why it had to get an extra reference to the bio before
    submitting it. But since we no longer touch the bio after submission,
    get rid of the redundant get/put of the bio. If we do get the extra
    reference, we enter the slower path of having to flag this bio as now
    having external references.

    Signed-off-by: Jens Axboe
    Signed-off-by: Theodore Ts'o

    Jens Axboe
     
  • Currently, in ext4_mb_init(), there's a loop like the following:

    do {
    ...
    offset += 1 << (sb->s_blocksize_bits - i);
    i++;
    } while (i s_blocksize_bits + 1);

    Note that the updated offset is used in the loop's next iteration only.

    However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
    the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
    and UBSAN reports

    UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
    shift exponent 4294967295 is too large for 32-bit type 'int'
    [...]
    Call Trace:
    [] dump_stack+0xbc/0x117
    [] ? _atomic_dec_and_lock+0x169/0x169
    [] ubsan_epilogue+0xd/0x4e
    [] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
    [] ? __ubsan_handle_load_invalid_value+0x158/0x158
    [] ? kmem_cache_alloc+0x101/0x390
    [] ? ext4_mb_init+0x13b/0xfd0
    [] ? create_cache+0x57/0x1f0
    [] ? create_cache+0x11a/0x1f0
    [] ? mutex_lock+0x38/0x60
    [] ? mutex_unlock+0x1b/0x50
    [] ? put_online_mems+0x5b/0xc0
    [] ? kmem_cache_create+0x117/0x2c0
    [] ext4_mb_init+0xc49/0xfd0
    [...]

    Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

    Unless compilers start to do some fancy transformations (which at least
    GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
    such calculated value of offset is never used again.

    Silence UBSAN by introducing another variable, offset_incr, holding the
    next increment to apply to offset and adjust that one by right shifting it
    by one position per loop iteration.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

    Cc: stable@vger.kernel.org
    Signed-off-by: Nicolai Stange
    Signed-off-by: Theodore Ts'o

    Nicolai Stange
     
  • Currently, in mb_find_order_for_block(), there's a loop like the following:

    while (order bd_blkbits + 1) {
    ...
    bb += 1 << (e4b->bd_blkbits - order);
    }

    Note that the updated bb is used in the loop's next iteration only.

    However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
    the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

    UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
    shift exponent -1 is negative
    [...]
    Call Trace:
    [] dump_stack+0xbc/0x117
    [] ? _atomic_dec_and_lock+0x169/0x169
    [] ubsan_epilogue+0xd/0x4e
    [] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
    [] ? __ubsan_handle_load_invalid_value+0x158/0x158
    [] ? ext4_mb_generate_from_pa+0x590/0x590
    [] ? ext4_read_block_bitmap_nowait+0x598/0xe80
    [] mb_find_order_for_block+0x1ce/0x240
    [...]

    Unless compilers start to do some fancy transformations (which at least
    GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
    such calculated value of bb is never used again.

    Silence UBSAN by introducing another variable, bb_incr, holding the next
    increment to apply to bb and adjust that one by right shifting it by one
    position per loop iteration.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

    Cc: stable@vger.kernel.org
    Signed-off-by: Nicolai Stange
    Signed-off-by: Theodore Ts'o

    Nicolai Stange
     

05 May, 2016

2 commits

  • When filesystem is corrupted in the right way, it can happen
    ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
    subsequently remove inode from the in-memory orphan list. However this
    deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
    leave i_orphan list_head with a stale content. Later we can look at this
    content causing list corruption, oops, or other issues. The reported
    trace looked like:

    WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
    list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
    0000000000100100)
    CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
    Stack:
    60462947 62219960 602ede24 62219960
    602ede24 603ca293 622198f0 602f02eb
    62219950 6002c12c 62219900 601b4d6b
    Call Trace:
    [] ? vprintk_emit+0x2dc/0x5c0
    [] ? printk+0x0/0x94
    [] show_stack+0xdc/0x1a0
    [] ? printk+0x0/0x94
    [] ? printk+0x0/0x94
    [] dump_stack+0x2a/0x2c
    [] warn_slowpath_common+0x9c/0xf0
    [] ? __list_del_entry+0x6b/0x100
    [] warn_slowpath_fmt+0x94/0xa0
    [] ? __mutex_lock_slowpath+0x239/0x3a0
    [] ? warn_slowpath_fmt+0x0/0xa0
    [] ? set_signals+0x3f/0x50
    [] ? kmem_cache_free+0x10a/0x180
    [] ? mutex_lock+0x18/0x30
    [] __list_del_entry+0x6b/0x100
    [] ext4_orphan_del+0x22c/0x2f0
    [] ? __ext4_journal_start_sb+0x2c/0xa0
    [] ? ext4_truncate+0x383/0x390
    [] ext4_write_begin+0x30b/0x4b0
    [] ? copy_from_user+0x0/0xb0
    [] ? iov_iter_fault_in_readable+0xa0/0xc0
    [] generic_perform_write+0xaf/0x1e0
    [] ? file_update_time+0x46/0x110
    [] __generic_file_write_iter+0x18f/0x1b0
    [] ext4_file_write_iter+0x15f/0x470
    [] ? unlink_file_vma+0x0/0x70
    [] ? unlink_anon_vmas+0x0/0x260
    [] ? free_pgtables+0xb9/0x100
    [] __vfs_write+0xb0/0x130
    [] vfs_write+0xa5/0x170
    [] SyS_write+0x56/0xe0
    [] ? __libc_waitpid+0x0/0xa0
    [] handle_syscall+0x68/0x90
    [] userspace+0x4fd/0x600
    [] ? save_registers+0x1f/0x40
    [] ? arch_prctl+0x177/0x1b0
    [] fork_handler+0x85/0x90

    Fix the problem by using list_del_init() as we always should with
    i_orphan list.

    CC: stable@vger.kernel.org
    Reported-by: Vegard Nossum
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • A failed call to dqget() returns an ERR_PTR() and not null. Fix
    the check in ext4_ioctl_setproject() to handle this correctly.

    Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
    Cc: stable@vger.kernel.org # v4.5
    Signed-off-by: Seth Forshee
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Seth Forshee
     

30 Apr, 2016

2 commits

  • Instead of just printing warning messages, if the orphan list is
    corrupted, declare the file system is corrupted. If there are any
    reserved inodes in the orphaned inode list, declare the file system
    corrupted and stop right away to avoid doing more potential damage to
    the file system.

    Cc: stable@vger.kernel.org
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • If the orphaned inode list contains inode #5, ext4_iget() returns a
    bad inode (since the bootloader inode should never be referenced
    directly). Because of the bad inode, we end up processing the inode
    repeatedly and this hangs the machine.

    This can be reproduced via:

    mke2fs -t ext4 /tmp/foo.img 100
    debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
    mount -o loop /tmp/foo.img /mnt

    (But don't do this if you are using an unpatched kernel if you care
    about the system staying functional. :-)

    This bug was found by the port of American Fuzzy Lop into the kernel
    to find file system problems[1]. (Since it *only* happens if inode #5
    shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
    surprising that AFL needed two hours before it found it.)

    [1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

    Cc: stable@vger.kernel.org
    Reported by: Vegard Nossum
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o