10 Jul, 2015

1 commit

  • The FITRIM ioctl has the same arguments on 32-bit and 64-bit
    architectures, so we can add it to the list of compatible ioctls and
    drop it from compat_ioctl method of various filesystems.

    Signed-off-by: Mikulas Patocka
    Cc: Al Viro
    Cc: Ted Ts'o
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

06 Jul, 2015

2 commits

  • Pull ext4 bugfixes from Ted Ts'o:
    "Bug fixes (all for stable kernels) for ext4:

    - address corner cases for indirect blocks->extent migration

    - fix reserved block accounting invalidate_page when
    page_size != block_size (i.e., ppc or 1k block size file systems)

    - fix deadlocks when a memcg is under heavy memory pressure

    - fix fencepost error in lazytime optimization"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: replace open coded nofail allocation in ext4_free_blocks()
    ext4: correctly migrate a file with a hole at the beginning
    ext4: be more strict when migrating to non-extent based file
    ext4: fix reservation release on invalidatepage for delalloc fs
    ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
    bufferhead: Add _gfp version for sb_getblk()
    ext4: fix fencepost error in lazytime optimization

    Linus Torvalds
     
  • ext4_free_blocks is looping around the allocation request and mimics
    __GFP_NOFAIL behavior without any allocation fallback strategy. Let's
    remove the open coded loop and replace it with __GFP_NOFAIL. Without the
    flag the allocator has no way to find out never-fail requirement and
    cannot help in any way.

    Signed-off-by: Michal Hocko
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Michal Hocko
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

04 Jul, 2015

3 commits

  • Currently ext4_ind_migrate() doesn't correctly handle a file which
    contains a hole at the beginning of the file. This caused the migration
    to be done incorrectly, and then if there is a subsequent following
    delayed allocation write to the "hole", this would reclaim the same data
    blocks again and results in fs corruption.

    # assmuing 4k block size ext4, with delalloc enabled
    # skip the first block and write to the second block
    xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

    # converting to indirect-mapped file, which would move the data blocks
    # to the beginning of the file, but extent status cache still marks
    # that region as a hole
    chattr -e /mnt/ext4/testfile

    # delayed allocation writes to the "hole", reclaim the same data block
    # again, results in i_blocks corruption
    xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
    umount /mnt/ext4
    e2fsck -nf /dev/sda6
    ...
    Inode 53, i_blocks is 16, should be 8. Fix? no
    ...

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     
  • Currently the check in ext4_ind_migrate() is not enough before doing the
    real conversion:

    a) delayed allocated extents could bypass the check on eh->eh_entries
    and eh->eh_depth

    This can be demonstrated by this script

    xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
    chattr -e /mnt/ext4/testfile

    where testfile has two extents but still be converted to non-extent
    based file format.

    b) only extent length is checked but not the offset, which would result
    in data lose (delalloc) or fs corruption (nodelalloc), because
    non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
    blocks

    This can be demostrated by

    xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
    chattr -e /mnt/ext4/testfile
    sync

    If delalloc is enabled, dmesg prints
    EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
    EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
    EXT4-fs (dm-4): This should not happen!! Data will be lost

    If delalloc is disabled, e2fsck -nf shows corruption
    Inode 53, i_size is 5497558142976, should be 4096. Fix? no

    Fix the two issues by

    a) forcing all delayed allocation blocks to be allocated before checking
    eh->eh_depth and eh->eh_entries
    b) limiting the last logical block of the extent is within direct map

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     
  • On delalloc enabled file system on invalidatepage operation
    in ext4_da_page_release_reservation() we want to clear the delayed
    buffer and remove the extent covering the delayed buffer from the extent
    status tree.

    However currently there is a bug where on the systems with page size >
    block size we will always remove extents from the start of the page
    regardless where the actual delayed buffers are positioned in the page.
    This leads to the errors like this:

    EXT4-fs warning (device loop0): ext4_da_release_space:1225:
    ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
    blocks

    This however can cause data loss on writeback time if the file system is
    in ENOSPC condition because we're releasing reservation for someones
    else delayed buffer.

    Fix this by only removing extents that corresponds to the part of the
    page we want to invalidate.

    This problem is reproducible by the following fio receipt (however I was
    only able to reproduce it with fio-2.1 or older.

    [global]
    bs=8k
    iodepth=1024
    iodepth_batch=60
    randrepeat=1
    size=1m
    directory=/mnt/test
    numjobs=20
    [job1]
    ioengine=sync
    bs=1k
    direct=1
    rw=randread
    filename=file1:file2
    [job2]
    ioengine=libaio
    rw=randwrite
    direct=1
    filename=file1:file2
    [job3]
    bs=1k
    ioengine=posixaio
    rw=randwrite
    direct=1
    filename=file1:file2
    [job5]
    bs=1k
    ioengine=sync
    rw=randread
    filename=file1:file2
    [job7]
    ioengine=libaio
    rw=randwrite
    filename=file1:file2
    [job8]
    ioengine=posixaio
    rw=randwrite
    filename=file1:file2
    [job10]
    ioengine=mmap
    rw=randwrite
    bs=1k
    filename=file1:file2
    [job11]
    ioengine=mmap
    rw=randwrite
    direct=1
    filename=file1:file2

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@vger.kernel.org

    Lukas Czerner
     

02 Jul, 2015

2 commits


01 Jul, 2015

1 commit

  • Pul xfs updates from Dave Chinner:
    "There's a couple of small API changes to the core DAX code which
    required small changes to the ext2 and ext4 code bases, but otherwise
    everything is within the XFS codebase.

    This update contains:

    - A new sparse on-disk inode record format to allow small extents to
    be used for inode allocation when free space is fragmented.

    - DAX support. This includes minor changes to the DAX core code to
    fix problems with lock ordering and bufferhead mapping abuse.

    - transaction commit interface cleanup

    - removal of various unnecessary XFS specific type definitions

    - cleanup and optimisation of freelist preparation before allocation

    - various minor cleanups

    - bug fixes for
    - transaction reservation leaks
    - incorrect inode logging in unwritten extent conversion
    - mmap lock vs freeze ordering
    - remote symlink mishandling
    - attribute fork removal issues"

    * tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
    xfs: don't truncate attribute extents if no extents exist
    xfs: clean up XFS_MIN_FREELIST macros
    xfs: sanitise error handling in xfs_alloc_fix_freelist
    xfs: factor out free space extent length check
    xfs: xfs_alloc_fix_freelist() can use incore perag structures
    xfs: remove xfs_caddr_t
    xfs: use void pointers in log validation helpers
    xfs: return a void pointer from xfs_buf_offset
    xfs: remove inst_t
    xfs: remove __psint_t and __psunsigned_t
    xfs: fix remote symlinks on V5/CRC filesystems
    xfs: fix xfs_log_done interface
    xfs: saner xfs_trans_commit interface
    xfs: remove the flags argument to xfs_trans_cancel
    xfs: pass a boolean flag to xfs_trans_free_items
    xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
    xfs: check min blks for random debug mode sparse allocations
    xfs: fix sparse inodes 32-bit compile failure
    xfs: add initial DAX support
    xfs: add DAX IO path support
    ...

    Linus Torvalds
     

27 Jun, 2015

1 commit

  • Merge second patchbomb from Andrew Morton:

    - most of the rest of MM

    - lots of misc things

    - procfs updates

    - printk feature work

    - updates to get_maintainer, MAINTAINERS, checkpatch

    - lib/ updates

    * emailed patches from Andrew Morton : (96 commits)
    exit,stats: /* obey this comment */
    coredump: add __printf attribute to cn_*printf functions
    coredump: use from_kuid/kgid when formatting corename
    fs/reiserfs: remove unneeded cast
    NILFS2: support NFSv2 export
    fs/befs/btree.c: remove unneeded initializations
    fs/minix: remove unneeded cast
    init/do_mounts.c: add create_dev() failure log
    kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
    fs/efs: femove unneeded cast
    checkpatch: emit "NOTE: " message only once after multiple files
    checkpatch: emit an error when there's a diff in a changelog
    checkpatch: validate MODULE_LICENSE content
    checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
    checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
    checkpatch: fix processing of MEMSET issues
    checkpatch: suggest using ether_addr_equal*()
    checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
    checkpatch: remove local from codespell path
    checkpatch: add --showfile to allow input via pipe to show filenames
    ...

    Linus Torvalds
     

26 Jun, 2015

4 commits

  • This makes a very large function a little smaller.

    Signed-off-by: Rasmus Villemoes
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     
  • Pull core block IO update from Jens Axboe:
    "Nothing really major in here, mostly a collection of smaller
    optimizations and cleanups, mixed with various fixes. In more detail,
    this contains:

    - Addition of policy specific data to blkcg for block cgroups. From
    Arianna Avanzini.

    - Various cleanups around command types from Christoph.

    - Cleanup of the suspend block I/O path from Christoph.

    - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

    - Eliminating atomic inc/dec of both remaining IO count and reference
    count in a bio. From me.

    - Fixes for SG gap and chunk size support for data-less (discards)
    IO, so we can merge these better. From me.

    - Small restructuring of blk-mq shared tag support, freeing drivers
    from iterating hardware queues. From Keith Busch.

    - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
    IOPS mode the default for non-rotational storage"

    * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
    cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
    cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
    cfq-iosched: move group scheduling functions under ifdef
    cfq-iosched: fix the setting of IOPS mode on SSDs
    blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
    block, cgroup: implement policy-specific per-blkcg data
    block: Make CFQ default to IOPS mode on SSDs
    block: add blk_set_queue_dying() to blkdev.h
    blk-mq: Shared tag enhancements
    block: don't honor chunk sizes for data-less IO
    block: only honor SG gap prevention for merges that contain data
    block: fix returnvar.cocci warnings
    block, dm: don't copy bios for request clones
    block: remove management of bi_remaining when restoring original bi_end_io
    block: replace trylock with mutex_lock in blkdev_reread_part()
    block: export blkdev_reread_part() and __blkdev_reread_part()
    suspend: simplify block I/O handling
    block: collapse bio bit space
    block: remove unused BIO_RW_BLOCK and BIO_EOF flags
    block: remove BIO_EOPNOTSUPP
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "A very large number of cleanups and bug fixes --- in particular for
    the ext4 encryption patches, which is a new feature added in the last
    merge window. Also fix a number of long-standing xfstest failures.
    (Quota writes failing due to ENOSPC, a race between truncate and
    writepage in data=journalled mode that was causing generic/068 to
    fail, and other corner cases.)

    Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
    performance eliminating locking when a buffer is modified more than
    once during a transaction (which is very common for allocation
    bitmaps, for example), in which case the state of the journalled
    buffer head doesn't need to change"

    [ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in
    the merge resolution, to make it clear that that function is _only_
    used for encrypted symlinks. The function doesn't actually work for
    non-encrypted symlinks at all, and they use the generic helpers
    - Linus ]

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
    ext4: set lazytime on remount if MS_LAZYTIME is set by mount
    ext4: only call ext4_truncate when size t retry file block mapping on bigalloc fs with non-extent file
    ext4: prevent ext4_quota_write() from failing due to ENOSPC
    ext4: call sync_blockdev() before invalidate_bdev() in put_super()
    jbd2: speedup jbd2_journal_dirty_metadata()
    jbd2: get rid of open coded allocation retry loop
    ext4: improve warning directory handling messages
    jbd2: fix ocfs2 corrupt when updating journal superblock fails
    ext4: mballoc: avoid 20-argument function call
    ext4: wait for existing dio workers in ext4_alloc_file_blocks()
    ext4: recalculate journal credits as inode depth changes
    jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
    ext4: use swap() in mext_page_double_lock()
    ext4: use swap() in memswap()
    ext4: fix race between truncate and __ext4_journalled_writepage()
    ext4 crypto: fail the mount if blocksize != pagesize
    ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
    ...

    Linus Torvalds
     

24 Jun, 2015

1 commit


23 Jun, 2015

2 commits

  • Newer versions of mount parse the lazytime feature and pass it to the
    mount system call via the flags field in the mount system call,
    removing the lazytime string from the mount options list. So we need
    to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
    order for this flag to be set on a remount.

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • Pull vfs updates from Al Viro:
    "In this pile: pathname resolution rewrite.

    - recursion in link_path_walk() is gone.

    - nesting limits on symlinks are gone (the only limit remaining is
    that the total amount of symlinks is no more than 40, no matter how
    nested).

    - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

    - stack footprint (independent of the nesting) is below kilobyte now,
    about on par with what it used to be with one level of nested
    symlinks and ~2.8 times lower than it used to be in the worst case.

    - struct nameidata is entirely private to fs/namei.c now (not even
    opaque pointers are being passed around).

    - ->follow_link() and ->put_link() calling conventions had been
    changed; all in-tree filesystems converted, out-of-tree should be
    able to follow reasonably easily.

    For out-of-tree conversions, see Documentation/filesystems/porting
    for details (and in-tree filesystems for examples of conversion).

    That has sat in -next since mid-May, seems to survive all testing
    without regressions and merges clean with v4.1"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
    turn user_{path_at,path,lpath,path_dir}() into static inlines
    namei: move saved_nd pointer into struct nameidata
    inline user_path_create()
    inline user_path_parent()
    namei: trim do_last() arguments
    namei: stash dfd and name into nameidata
    namei: fold path_cleanup() into terminate_walk()
    namei: saner calling conventions for filename_parentat()
    namei: saner calling conventions for filename_create()
    namei: shift nameidata down into filename_parentat()
    namei: make filename_lookup() reject ERR_PTR() passed as name
    namei: shift nameidata inside filename_lookup()
    namei: move putname() call into filename_lookup()
    namei: pass the struct path to store the result down into path_lookupat()
    namei: uninline set_root{,_rcu}()
    namei: be careful with mountpoint crossings in follow_dotdot_rcu()
    Documentation: remove outdated information from automount-support.txt
    get rid of assorted nameidata-related debris
    lustre: kill unused helper
    lustre: kill unused macro (LOOKUP_CONTINUE)
    ...

    Linus Torvalds
     

22 Jun, 2015

4 commits

  • At LSF we decided that if we truncate up from isize we shouldn't trim
    fallocated blocks that were fallocated with KEEP_SIZE and are past the
    new i_size. This patch fixes ext4 to do this.

    [ Completely reworked patch so that i_disksize would actually get set
    when truncating up. Also reworked the code for handling truncate so
    that it's easier to handle. -- tytso ]

    Signed-off-by: Josef Bacik
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Lukas Czerner

    Josef Bacik
     
  • Make the error reporting behavior resulting from the unsupported use
    of online defrag on files with data journaling enabled consistent with
    that implemented for bigalloc file systems. Difference found with
    ext4/308.

    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Darrick J. Wong

    Eric Whitney
     
  • Remove outdated comments and dead code from ext4_da_reserve_space.
    Clean up its trace point, and relocate it to make it more useful.

    While we're at it, fix a nearby conditional used to determine if
    we have a non-bigalloc file system. It doesn't match usage elsewhere
    in the code, and misleadingly suggests that an s_cluster_ratio value
    of 0 would be legal.

    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Eric Whitney
     
  • ext4 isn't willing to map clusters to a non-extent file. Don't signal
    this with an out of space error, since the FS will retry the
    allocation (which didn't fail) forever. Instead, return EUCLEAN so
    that the operation will fail immediately all the way back to userspace.

    (The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Darrick J. Wong
     

21 Jun, 2015

2 commits

  • In order to prevent quota block tracking to be inaccurate when
    ext4_quota_write() fails with ENOSPC, we make two changes. The quota
    file can now use the reserved block (since the quota file is arguably
    file system metadata), and ext4_quota_write() now uses
    ext4_should_retry_alloc() to retry the block allocation after a commit
    has completed and released some blocks for allocation.

    This fixes failures of xfstests generic/270:

    Quota error (device vdc): write_blk: dquota write failed
    Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Normally all of the buffers will have been forced out to disk before
    we call invalidate_bdev(), but there will be some cases, where a file
    system operation was aborted due to an ext4_error(), where there may
    still be some dirty buffers in the buffer cache for the device. So
    try to force them out to memory before calling invalidate_bdev().

    This fixes a warning triggered by generic/081:

    WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

16 Jun, 2015

1 commit

  • Several ext4_warning() messages in the directory handling code do not
    report the inode number of the (potentially corrupt) directory where a
    problem is seen, and others report this in an ad-hoc manner. Add an
    ext4_warning_inode() helper to print the inode number and command name
    consistent with ext4_error_inode().

    Consolidate the place in ext4.h that these macros are defined.

    Clean up some other directory error and warning messages to print the
    calling function name.

    Minor code style fixes in nearby lines.

    Signed-off-by: Andreas Dilger
    Signed-off-by: Theodore Ts'o

    Andreas Dilger
     

15 Jun, 2015

3 commits

  • Making a function call with 20 arguments is rather expensive in both
    stack and .text. In this case, doing the formatting manually doesn't
    make it any less readable, so we might as well save 155 bytes of .text
    and 112 bytes of stack.

    Signed-off-by: Rasmus Villemoes

    Rasmus Villemoes
     
  • Currently existing dio workers can jump in and potentially increase
    extent tree depth while we're allocating blocks in
    ext4_alloc_file_blocks(). This may cause us to underestimate the
    number of credits needed for the transaction because the extent tree
    depth can change after our estimation.

    Fix this by waiting for all the existing dio workers in the same way
    as we do it in ext4_punch_hole. We've seen errors caused by this in
    xfstest generic/299, however it's really hard to reproduce.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     
  • Currently in ext4_alloc_file_blocks() the number of credits is
    calculated only once before we enter the allocation loop. However within
    the allocation loop the extent tree depth can change, hence the number
    of credits needed can increase potentially exceeding the number of credits
    reserved in the handle which can cause journal failures.

    Fix this by recalculating number of credits when the inode depth
    changes. Note that even though ext4_alloc_file_blocks() is only
    currently used by extent base inodes we will avoid recalculating number
    of credits unnecessarily in the case of indirect based inodes.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     

13 Jun, 2015

4 commits

  • Use kernel.h macro definition.

    Thanks to Julia Lawall for Coccinelle scripting support.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Theodore Ts'o

    Fabian Frederick
     
  • Use kernel.h macro definition.

    Thanks to Julia Lawall for Coccinelle scripting support.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Theodore Ts'o

    Fabian Frederick
     
  • The commit cf108bca465d: "ext4: Invert the locking order of page_lock
    and transaction start" caused __ext4_journalled_writepage() to drop
    the page lock before the page was written back, as part of changing
    the locking order to jbd2_journal_start -> page_lock. However, this
    introduced a potential race if there was a truncate racing with the
    data=journalled writeback mode.

    Fix this by grabbing the page lock after starting the journal handle,
    and then checking to see if page had gotten truncated out from under
    us.

    This fixes a number of different warnings or BUG_ON's when running
    xfstests generic/086 in data=journalled mode, including:

    jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
    c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0

    - and -

    kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
    Call Trace:
    [] ? __ext4_journalled_invalidatepage+0x117/0x117
    [] __ext4_journalled_invalidatepage+0x10f/0x117
    [] ? __ext4_journalled_invalidatepage+0x117/0x117
    [] ? lock_buffer+0x36/0x36
    [] ext4_journalled_invalidatepage+0xd/0x22
    [] do_invalidatepage+0x22/0x26
    [] truncate_inode_page+0x5b/0x85
    [] truncate_inode_pages_range+0x156/0x38c
    [] truncate_inode_pages+0x11/0x15
    [] truncate_pagecache+0x55/0x71
    [] ext4_setattr+0x4a9/0x560
    [] ? current_kernel_time+0x10/0x44
    [] notify_change+0x1c7/0x2be
    [] do_truncate+0x65/0x85
    [] ? file_ra_state_init+0x12/0x29

    - and -

    WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
    irty_metadata+0x14a/0x1ae()
    ...
    Call Trace:
    [] ? console_unlock+0x3a1/0x3ce
    [] dump_stack+0x48/0x60
    [] warn_slowpath_common+0x89/0xa0
    [] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
    [] warn_slowpath_null+0x14/0x18
    [] jbd2_journal_dirty_metadata+0x14a/0x1ae
    [] __ext4_handle_dirty_metadata+0xd4/0x19d
    [] write_end_fn+0x40/0x53
    [] ext4_walk_page_buffers+0x4e/0x6a
    [] ext4_writepage+0x354/0x3b8
    [] ? mpage_release_unused_pages+0xd4/0xd4
    [] ? wait_on_buffer+0x2c/0x2c
    [] ? ext4_writepage+0x3b8/0x3b8
    [] __writepage+0x10/0x2e
    [] write_cache_pages+0x22d/0x32c
    [] ? ext4_writepage+0x3b8/0x3b8
    [] ext4_writepages+0x102/0x607
    [] ? sched_clock_local+0x10/0x10e
    [] ? __lock_is_held+0x2e/0x44
    [] ? lock_is_held+0x43/0x51
    [] do_writepages+0x1c/0x29
    [] __writeback_single_inode+0xc3/0x545
    [] writeback_sb_inodes+0x21f/0x36d
    ...

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • We currently don't correctly handle the case where blocksize !=
    pagesize, so disallow the mount in those cases.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

09 Jun, 2015

2 commits

  • This patch implements fallocate's FALLOC_FL_INSERT_RANGE for Ext4.

    1) Make sure that both offset and len are block size aligned.
    2) Update the i_size of inode by len bytes.
    3) Compute the file's logical block number against offset. If the computed
    block number is not the starting block of the extent, split the extent
    such that the block number is the starting block of the extent.
    4) Shift all the extents which are lying between [offset, last allocated extent]
    towards right by len bytes. This step will make a hole of len bytes
    at offset.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan

    Namjae Jeon
     
  • [ Added another sparse fix for EXT4_IOC_GET_ENCRYPTION_POLICY while
    we're at it. --tytso ]

    Signed-off-by: Fabian Frederick
    Signed-off-by: Theodore Ts'o

    Fabian Frederick
     

08 Jun, 2015

5 commits

  • During a source code review of fs/ext4/extents.c I noted identical
    consecutive lines. An assertion is repeated for inode1 and never done
    for inode2. This is not in keeping with the rest of the code in the
    ext4_swap_extents function and appears to be a bug.

    Assert that the inode2 mutex is not locked.

    Signed-off-by: David Moore
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Eric Sandeen

    David Moore
     
  • Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
    the allocation group is suitable for use or not. However we might get
    various errors and fail while initializing new group including -EIO
    which would never get propagated up the call chain. This might lead to
    an endless loop at writeback when we're trying to find a good group to
    allocate from and we fail to initialize new group (read error for
    example).

    Fix this by returning proper error code from ext4_mb_good_group() and
    using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
    we will always return only the first occurred error from
    ext4_mb_good_group() and we only propagate it back to the caller if we
    do not get any other errors and we fail to allocate any blocks.

    Note that with other modes than errors=continue, we will fail
    immediately in ext4_mb_good_group() in case of error, however with
    errors=continue we should try to continue using the file system, that's
    why we're not going to fail immediately when we see an error from
    ext4_mb_good_group(), but rather when we fail to find a suitable block
    group to allocate from due to an problem in group initialization.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Darrick J. Wong

    Lukas Czerner
     
  • Currently on the machines with page size > block size when initializing
    block group buddy cache we initialize it for all the block group bitmaps
    in the page. However in the case of read error, checksum error, or if
    a single bitmap is in any way corrupted we would fail to initialize all
    of the bitmaps. This is problematic because we will not have access to
    the other allocation groups even though those might be perfectly fine
    and usable.

    Fix this by reading all the bitmaps instead of error out on the first
    problem and simply skip the bitmaps which were either not read properly,
    or are not valid.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     
  • If we want to rely on the buffer_verified() flag of the block bitmap
    buffer, we have to set it consistently. However currently if we're
    initializing uninitialized block bitmap in
    ext4_read_block_bitmap_nowait() we're not going to set buffer verified
    at all.

    We can do this by simply setting the flag on the buffer, but I think
    it's actually better to run ext4_validate_block_bitmap() to make sure
    that what we did in the ext4_init_block_bitmap() is right.

    So run ext4_validate_block_bitmap() even after the block bitmap
    initialization. Also bail out early from ext4_validate_block_bitmap() if
    we see corrupt bitmap, since we already know it's corrupt and we do not
    need to verify that.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Theodore Ts'o

    Lukas Czerner
     

04 Jun, 2015

1 commit

  • dax_fault() currently relies on the get_block callback to attach an
    io completion callback to the mapping buffer head so that it can
    run unwritten extent conversion after zeroing allocated blocks.

    Instead of this hack, pass the conversion callback directly into
    dax_fault() similar to the get_block callback. When the filesystem
    allocates unwritten extents, it will set the buffer_unwritten()
    flag, and hence the dax_fault code can call the completion function
    in the contexts where it is necessary without overloading the
    mapping buffer head.

    Note: The changes to ext4 to use this interface are suspect at best.
    In fact, the way ext4 did this end_io assignment in the first place
    looks suspect because it only set a completion callback when there
    wasn't already some other write() call taking place on the same
    inode. The ext4 end_io code looks rather intricate and fragile with
    all it's reference counting and passing to different contexts for
    modification via inode private pointers that aren't protected by
    locks...

    Signed-off-by: Dave Chinner
    Acked-by: Jan Kara
    Signed-off-by: Dave Chinner

    Dave Chinner