15 Sep, 2016

3 commits

  • Create a macro to calculate length + offset -> maximum blocks
    This adds more readability.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Theodore Ts'o

    Fabian Frederick
     
  • ext4_alloc_file_blocks() is called from ext4_zero_range() and
    ext4_fallocate() both already testing EXT4_INODE_EXTENTS
    We can call ext_depth(inode) unconditionnally.

    [ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
    called for a indirect-mapped inode in the future. -- tytso ]

    Signed-off-by: Fabian Frederick
    Signed-off-by: Theodore Ts'o

    Fabian Frederick
     
  • Running xfstests generic/013 with kmemleak gives the following:

    unreferenced object 0xffff8801d3d27de0 (size 96):
    comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x23/0x40
    [] __kmalloc+0xf5/0x1d0
    [] ext4_find_extent+0x1ec/0x2f0
    [] ext4_insert_range+0x34c/0x4a0
    [] ext4_fallocate+0x4e2/0x8b0
    [] vfs_fallocate+0x134/0x210
    [] SyS_fallocate+0x3f/0x60
    [] entry_SYSCALL_64_fastpath+0x13/0x8f
    [] 0xffffffffffffffff

    Problem seems mitigated by dropping refs and freeing path
    when there's no path[depth].p_ext

    Cc: stable@vger.kernel.org
    Signed-off-by: Fabian Frederick
    Signed-off-by: Theodore Ts'o

    Fabian Frederick
     

15 Jul, 2016

1 commit

  • Although the extent tree depth of 5 should enough be for the worst
    case of 2*32 extents of length 1, the extent tree code does not
    currently to merge nodes which are less than half-full with a sibling
    node, or to shrink the tree depth if possible. So it's possible, at
    least in theory, for the tree depth to be greater than 5. However,
    even in the worst case, a tree depth of 32 is highly unlikely, and if
    the file system is maliciously corrupted, an insanely large eh_depth
    can cause memory allocation failures that will trigger kernel warnings
    (here, eh_depth = 65280):

    JBD2: ext4.exe wants too many credits credits:195849 rsv_credits:0 max:256
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 50 at fs/jbd2/transaction.c:293 start_this_handle+0x569/0x580
    CPU: 0 PID: 50 Comm: ext4.exe Not tainted 4.7.0-rc5+ #508
    Stack:
    604a8947 625badd8 0002fd09 00000000
    60078643 00000000 62623910 601bf9bc
    62623970 6002fc84 626239b0 900000125
    Call Trace:
    [] show_stack+0xdc/0x1a0
    [] dump_stack+0x2a/0x2e
    [] __warn+0x114/0x140
    [] warn_slowpath_null+0x1f/0x30
    [] start_this_handle+0x569/0x580
    [] jbd2__journal_start+0x11e/0x220
    [] __ext4_journal_start_sb+0x60/0xa0
    [] ext4_truncate+0x131/0x3a0
    [] ext4_setattr+0x757/0x840
    [] notify_change+0x16f/0x2a0
    [] do_truncate+0x76/0xc0
    [] path_openat+0x806/0x1300
    [] do_filp_open+0x89/0xf0
    [] do_sys_open+0x134/0x1e0
    [] SyS_open+0x20/0x30
    [] handle_syscall+0x88/0x90
    [] userspace+0x3fd/0x500
    [] fork_handler+0x85/0x90

    ---[ end trace 08b0b88b6387a244 ]---

    [ Commit message modified and the extent tree depath check changed
    from 5 to 32 -- tytso ]

    Cc: Darrick J. Wong
    Signed-off-by: Vegard Nossum
    Signed-off-by: Theodore Ts'o

    Vegard Nossum
     

30 Jun, 2016

1 commit

  • An extent with lblock = 4294967295 and len = 1 will pass the
    ext4_valid_extent() test:

    ext4_lblk_t last = lblock + len - 1;

    if (len == 0 || lblock > last)
    return 0;

    since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger
    the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end().

    We can simplify it by removing the - 1 altogether and changing the test
    to use lblock + len 0 then lblock + len > lblock in order
    to pass (i.e. it doesn't overflow).

    Fixes: 5946d0893 ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
    Fixes: 2f974865f ("ext4: check for zero length extent explicitly")
    Cc: Eryu Guan
    Cc: stable@vger.kernel.org
    Signed-off-by: Phil Turnbull
    Signed-off-by: Vegard Nossum
    Signed-off-by: Theodore Ts'o

    Vegard Nossum
     

06 May, 2016

1 commit


27 Apr, 2016

1 commit


26 Apr, 2016

1 commit

  • The function jbd2_journal_extend() takes as its argument the number of
    new credits to be added to the handle. We weren't taking into account
    the currently unused handle credits; worse, we would try to extend the
    handle by N credits when it had N credits available.

    In the case where jbd2_journal_extend() fails because the transaction
    is too large, when jbd2_journal_restart() gets called, the N credits
    owned by the handle gets returned to the transaction, and the
    transaction commit is asynchronously requested, and then
    start_this_handle() will be able to successfully attach the handle to
    the current transaction since the required credits are now available.

    This is mostly harmless, but since ext4_ext_truncate_extend_restart()
    returns EAGAIN, the truncate machinery will once again try to call
    ext4_ext_truncate_extend_restart(), which will do the above sequence
    over and over again until the transaction has committed.

    This was found while I was debugging a lockup in caused by running
    xfstests generic/074 in the data=journal case. I'm still not sure why
    we ended up looping forever, which suggests there may still be another
    bug hiding in the transaction accounting machinery, but this commit
    prevents us from looping in the first place.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

10 Mar, 2016

3 commits

  • Signed-off-by: Adam Buchbinder
    Signed-off-by: Theodore Ts'o

    Adam Buchbinder
     
  • Currently, ext4_map_blocks() just returns 0 when it finds a hole and
    allocation is not requested. However we have all the information
    available to tell how large the hole actually is and there are callers
    of ext4_map_blocks() which would save some block-by-block hole iteration
    if they knew this information. So fill in struct ext4_map_blocks even
    for holes with the information we have. We keep returning 0 for holes to
    maintain backward compatibility of the function.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • ext4_ext_put_gap_in_cache() determines hole size in the extent tree,
    then trims this with possible delayed allocated blocks, and inserts the
    result into the extent status tree. Factor out determination of the size
    of the hole in the extent tree as we will need this information in
    ext4_ext_map_blocks() as well.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

09 Mar, 2016

1 commit

  • When mapping blocks for direct IO, we allocate io_end structure before
    mapping blocks and store pointer to it in the inode. This creates a
    requirement that any AIO DIO using io_end must be protected by i_mutex.
    This created problems in the past with dioread_nolock mode which was
    corrupting io_end pointers. Also io_end is allocated unnecessarily in
    case where we don't need to convert any extents (which is a common case
    for example when overwriting file).

    We fix the problem by allocating io_end only once we return unwritten
    extent from block mapping function for AIO DIO (so we can save some
    pointless io_end allocations) and we pass pointer to it in bh->b_private
    which generic DIO code later passes to our end IO callback. That way we
    remove any need for global pointer to io_end structure and thus fix the
    races.

    The downside of this change is that the checking for unwritten IO in
    flight in ext4_extents_can_be_merged() is more racy since we now
    increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
    i_data_sem. However the check has been racy already before because
    ext4_writepages() already increment i_unwritten after dropping
    i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
    case.

    Signed-off-by: Jan Kara

    Jan Kara
     

23 Feb, 2016

1 commit


12 Feb, 2016

1 commit


23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

08 Dec, 2015

6 commits

  • DAX page fault path needs to get blocks that are pre-zeroed to avoid
    races when two concurrent page faults happen in the same block of a
    file. Implement support for this in ext4_map_blocks().

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Create new function ext4_issue_zeroout() to zeroout contiguous (both
    logically and physically) part of inode data. We will need to issue
    zeroout when extent structure is not readily available and this function
    will allow us to do it without making up fake extent structures.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • When doing delayed allocation, update of on-disk inode size is postponed
    until IO submission time. However hole punch or zero range fallocate
    calls can end up discarding the tail page cache page and thus on-disk
    inode size would never be properly updated.

    Make sure the on-disk inode size is updated before truncating page
    cache.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Current code implementing FALLOC_FL_COLLAPSE_RANGE and
    FALLOC_FL_INSERT_RANGE is prone to races with buffered writes and page
    faults. If buffered write or write via mmap manages to squeeze between
    filemap_write_and_wait_range() and truncate_pagecache() in the fallocate
    implementations, the written data is simply discarded by
    truncate_pagecache() although it should have been shifted.

    Fix the problem by moving filemap_write_and_wait_range() call inside
    i_mutex and i_mmap_sem. That way we are protected against races with
    both buffered writes and page faults.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently ext4_alloc_file_blocks() was handling protection against
    unlocked DIO. However we now need to sometimes call it under i_mmap_sem
    and sometimes not and DIO protection ranks above it (although strictly
    speaking this cannot currently create any deadlocks). Also
    ext4_zero_range() was actually getting & releasing unlocked DIO
    protection twice in some cases. Luckily it didn't introduce any real bug
    but it was a land mine waiting to be stepped on. So move DIO protection
    out from ext4_alloc_file_blocks() into the two callsites.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, page faults and hole punching are completely unsynchronized.
    This can result in page fault faulting in a page into a range that we
    are punching after truncate_pagecache_range() has been called and thus
    we can end up with a page mapped to disk blocks that will be shortly
    freed. Filesystem corruption will shortly follow. Note that the same
    race is avoided for truncate by checking page fault offset against
    i_size but there isn't similar mechanism available for punching holes.

    Fix the problem by creating new rw semaphore i_mmap_sem in inode and
    grab it for writing over truncate, hole punching, and other functions
    removing blocks from extent tree and for read over page faults. We
    cannot easily use i_data_sem for this since that ranks below transaction
    start and we need something ranking above it so that it can be held over
    the whole truncate / hole punching operation. Also remove various
    workarounds we had in the code to reduce race window when page fault
    could have created pages with stale mapping information.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

08 Nov, 2015

1 commit

  • Pull trivial updates from Jiri Kosina:
    "Trivial stuff from trivial tree that can be trivially summed up as:

    - treewide drop of spurious unlikely() before IS_ERR() from Viresh
    Kumar

    - cosmetic fixes (that don't really affect basic functionality of the
    driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

    - various comment / printk fixes and updates all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    bcache: Really show state of work pending bit
    hwmon: applesmc: fix comment typos
    Kconfig: remove comment about scsi_wait_scan module
    class_find_device: fix reference to argument "match"
    debugfs: document that debugfs_remove*() accepts NULL and error values
    net: Drop unlikely before IS_ERR(_OR_NULL)
    mm: Drop unlikely before IS_ERR(_OR_NULL)
    fs: Drop unlikely before IS_ERR(_OR_NULL)
    drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
    drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
    UBI: Update comments to reflect UBI_METAONLY flag
    pktcdvd: drop null test before destroy functions

    Linus Torvalds
     

18 Oct, 2015

2 commits


03 Oct, 2015

1 commit

  • Fix multiple bugs in ext4_encrypted_zeroout(), including one that
    could cause us to write an encrypted zero page to the wrong location
    on disk, potentially causing data and file system corruption.
    Fortunately, this tends to only show up in stress tests, but even with
    these fixes, we are seeing some test failures with generic/127 --- but
    these are now caused by data failures instead of metadata corruption.

    Since ext4_encrypted_zeroout() is only used for some optimizations to
    keep the extent tree from being too fragmented, and
    ext4_encrypted_zeroout() itself isn't all that optimized from a time
    or IOPS perspective, disable the extent tree optimization for
    encrypted inodes for now. This prevents the data corruption issues
    reported by generic/127 until we can figure out what's going wrong.

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

29 Sep, 2015

1 commit

  • IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
    is no need to do that again from its callers. Drop it.

    Signed-off-by: Viresh Kumar
    Reviewed-by: Jeff Layton
    Reviewed-by: David Howells
    Reviewed-by: Steve French
    Signed-off-by: Jiri Kosina

    Viresh Kumar
     

06 Jul, 2015

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "Bug fixes (all for stable kernels) for ext4:

    - address corner cases for indirect blocks->extent migration

    - fix reserved block accounting invalidate_page when
    page_size != block_size (i.e., ppc or 1k block size file systems)

    - fix deadlocks when a memcg is under heavy memory pressure

    - fix fencepost error in lazytime optimization"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: replace open coded nofail allocation in ext4_free_blocks()
    ext4: correctly migrate a file with a hole at the beginning
    ext4: be more strict when migrating to non-extent based file
    ext4: fix reservation release on invalidatepage for delalloc fs
    ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
    bufferhead: Add _gfp version for sb_getblk()
    ext4: fix fencepost error in lazytime optimization

    Linus Torvalds
     

02 Jul, 2015

1 commit


26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

21 Jun, 2015

1 commit

  • In order to prevent quota block tracking to be inaccurate when
    ext4_quota_write() fails with ENOSPC, we make two changes. The quota
    file can now use the reserved block (since the quota file is arguably
    file system metadata), and ext4_quota_write() now uses
    ext4_should_retry_alloc() to retry the block allocation after a commit
    has completed and released some blocks for allocation.

    This fixes failures of xfstests generic/270:

    Quota error (device vdc): write_blk: dquota write failed
    Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

15 Jun, 2015

2 commits

  • Currently existing dio workers can jump in and potentially increase
    extent tree depth while we're allocating blocks in
    ext4_alloc_file_blocks(). This may cause us to underestimate the
    number of credits needed for the transaction because the extent tree
    depth can change after our estimation.

    Fix this by waiting for all the existing dio workers in the same way
    as we do it in ext4_punch_hole. We've seen errors caused by this in
    xfstest generic/299, however it's really hard to reproduce.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     
  • Currently in ext4_alloc_file_blocks() the number of credits is
    calculated only once before we enter the allocation loop. However within
    the allocation loop the extent tree depth can change, hence the number
    of credits needed can increase potentially exceeding the number of credits
    reserved in the handle which can cause journal failures.

    Fix this by recalculating number of credits when the inode depth
    changes. Note that even though ext4_alloc_file_blocks() is only
    currently used by extent base inodes we will avoid recalculating number
    of credits unnecessarily in the case of indirect based inodes.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     

09 Jun, 2015

1 commit

  • This patch implements fallocate's FALLOC_FL_INSERT_RANGE for Ext4.

    1) Make sure that both offset and len are block size aligned.
    2) Update the i_size of inode by len bytes.
    3) Compute the file's logical block number against offset. If the computed
    block number is not the starting block of the extent, split the extent
    such that the block number is the starting block of the extent.
    4) Shift all the extents which are lying between [offset, last allocated extent]
    towards right by len bytes. This step will make a hole of len bytes
    at offset.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan

    Namjae Jeon
     

08 Jun, 2015

1 commit

  • During a source code review of fs/ext4/extents.c I noted identical
    consecutive lines. An assertion is repeated for inode1 and never done
    for inode2. This is not in keeping with the rest of the code in the
    ext4_swap_extents function and appears to be a bug.

    Assert that the inode2 mutex is not locked.

    Signed-off-by: David Moore
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Eric Sandeen

    David Moore
     

02 Jun, 2015

1 commit

  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 May, 2015

2 commits

  • The xfstests test suite assumes that an attempt to collapse range on
    the range (0, 1) will return EOPNOTSUPP if the file system does not
    support collapse range. Commit 280227a75b56: "ext4: move check under
    lock scope to close a race" broke this, and this caused xfstests to
    fail when run when testing file systems that did not have the extents
    feature enabled.

    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • The following commit introduced a bug when checking for zero length extent

    5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()

    Zero length extent could pass the check if lblock is zero.

    Adding the explicit check for zero length back.

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     

03 May, 2015

1 commit


12 Apr, 2015

1 commit


03 Apr, 2015

1 commit

  • When xfstests' auto group is run on a bigalloc filesystem with a
    4.0-rc3 kernel, e2fsck failures and kernel warnings occur for some
    tests. e2fsck reports incorrect iblocks values, and the warnings
    indicate that the space reserved for delayed allocation is being
    overdrawn at allocation time.

    Some of these errors occur because the reserved space is incorrectly
    decreased by one cluster when ext4_ext_map_blocks satisfies an
    allocation request by mapping an unused portion of a previously
    allocated cluster. Because a cluster's worth of reserved space was
    already released when it was first allocated, it should not be released
    again.

    This patch appears to correct the e2fsck failure reported for
    generic/232 and the kernel warnings produced by ext4/001, generic/009,
    and generic/033. Failures and warnings for some other tests remain to
    be addressed.

    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Eric Whitney